If there ever was an overloaded term in Computer Science, it’s models. For instance, my colleagues in the eXtreme Computing Group have this terrific ambition to model the entire world! What we’re talking about here is much simpler: it is a representation of a particular corpus.
One of the key insights in studying how documents are composed on the Web is that there are subtle but nevertheless distinct styles used, even in a single document. You know this intuitively — when you form a query to your search engine, for example, you often string together keywords to form a phrase, but you wouldn’t use the same phraseology in prose. Even in prose, the style in which we compose paragraphs will differ from how we compose titles or reference other material.
In the Web N-Gram service, we have collected from Bing textual data from four different sources (sometimes referred to as ‘streams’): Body, Anchor, Title, and Query. To get the list of currently supported models, use this handy URL.
There are two different naming conventions used by our service: the URN form when using XML/SOAP, and the Path form for REST access.
In this context, order=1 is unigram, order=2 is bigram, and so forth.
A number of papers have been published demonstrating computational tasks using different models. Here’s one that appeared in SIGIR: A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing.