Let's think of the scale of different lexicons, in terms of order of magnitude:
- 1,000 - the day-to-day vocabulary of someone in the United States
- 10,000 - the number of different words in Moby Dick
- 100,000 - the number of words understood by a state-of-the-art speech recognition engine
- 1,000,000,000 - the number of words found on the world-wide web
A speech recognition engine, which not only knows the probability of encountering any given word in its lexicon, but also the combination of word sequences, will run comfortably on any modern desktop computer. A web-scale lexicon, however, poses a greater challenge based on the sheer size of the data. You probably also lack the resources to crawl the entire web to even know the entire billion-word lexicon!
Fortunately for you, with Microsoft Bing, we have the data! What we're providing is a web service, available in the flavors of SOAP and REST, to serve you language model data. Over the coming weeks we will be giving some more details about this web service, but those who can't wait should check out the information here.
Our proposition is simple: we'll bring the data so you can focus on your research.