I just read Building and Exploring Web Corpora, which includes the Proceedings of the 3rd Web as Corpus Workshop (WAC3-2007) held at the University of Louvain-la-Neuve in September 2007. A number of papers describe how computational linguists have been using Microsoft’s Live Search Application Programming Interface (API) to build and clean corpora to be used in natural language processing.
One of the papers (Leturia et al.) describes the CorpEus tool, which uses the Live Search (LS) API and which the authors designed to create web corpora for the Basque language.
Another very interesting paper, by William Fletcher, describes the various reasons why that API was found to meet the linguists’ requirements to be able to generate concordances for linguistic research. Let me quote Fletcher here:
· Of the Search Engines which provide free APIs to developers, Live Search is the most generous by far: it allows 10,000 queries per application id (AppID) per IP address per day; [TF: Fletcher mentions 10,000 queries per day while Leturia et al. indicate that the API allows 25,000 queries per day. The latter figure is the correct one, in fact, which makes it even more generous]
· LS provides high-quality search results, with relatively few pages from link farms or “scraper sites”, which repeat content from or link to other pages merely for advertising revenue;
· It also supports search by location, i.e. by country or even latitude and longitude;
· Live Search is more responsive to changes on the Web: there is faster turnover in the top hits returned for a given query than with Google or Yahoo!, and documents in the cache tend to be “fresher”, i.e. updated more frequently;
· The LS cache provides quick, reliable access to the original texts. In documents retrieved from the cache, LS generally detects the character set encoding accurately and converts it to UTF-8, thereby eliminating a potential source of variability and errors;
· LS also converts Adobe Acrobat PDF documents to HTML which closely reflects the formatting of the original;
· The Live Search API provides direct links to the cache, and the site responds rapidly and at a high transfer rate, permitting very efficient data collection without delays, redirections or dead links.
Here are the full references of that paper, in case you want to read the whole story:
William H. Fletcher: Implementing a BNC-Compare-able Web Corpus, in Fairon, C., Naets, H., Kilgarriff, A., de Schrijver, G-M (eds): Building and Exploring Web Corpora – Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval (WAC3-2007, September 2007), UCL, Presses Universitaires de Louvain, 2007.
— Thierry Fontenelle (Program Manager)
This post is also published on the Office Natural Language Team blog where you can read about the linguistic technologies we develop.