Language Modeling 101

The Microsoft Web N-Gram service, at its core, is a data
service that returns conditional probabilities of words given a context.
 But what does that exactly mean?  Let me explain.

Conditional probability is usually expressed with a vertical
bar: P(w|c).  In plain English you would
say: what is the probability of w given
c?  In language modeling, w represents a
word, and c represents the context, which is a fancy way of saying the sequence
of words that come before w. 

The number of words that the service will consider in a
query is known as the order, which is
the N in N-gram.  The order is
split in to two - one for the word (w) itself, and N-1 for the context (c).   For a
1-gram, or a unigram, there is no context at all, but instead the simple probability of
a given word amongst all words.  Let's jump
ahead and show you some real values using the Python library available here:

 >>> import MicrosoftNgram
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:1')
>>> s.GetConditionalProbability('chris')
-4.1125360000000004
>>> s.GetConditionalProbability('the')
-1.480988

For the moment, ignore the details about the instantiation
of the LookupService object (i.e. s), and treat it as a black box that can tell
you the unigram probability.  The return
values are negative because they are log values in base-10.  Because a probability value, in linear space,
will always between 0 and 1, the same will be between negative infinity and 0 in log
space.  So the odds of seeing my first
name is 1:1/P=1/10^(-4.112536) or approximately 1:13000.  Contrast this with the odds of seeing the
most common word in English on the web, the:
1:30.

But language modeling is often more interesting at
higher-orders of N: bigrams, trigrams, and so on.  Let's try some more examples:

 >>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:2')
>>> s.GetConditionalProbability('star wars')
-1.1905209999999999
>>> s.GetConditionalProbability('star map')
-3.7370559999999999

The key change from the earlier example is how s was
instantiated, namely the '2' at the very end of the argument.  This indicates that we're interested in
bigrams.  There'll be more on models in
the near future.    Anyway, the queries show that given the
context of 'star', we are more than 100x times likely to have the word 'wars'
than 'map.'   And now for a trigram example:

 >>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:3')
>>> s.GetConditionalProbability('i can has')
-2.4931369999999999
>>> s.GetConditionalProbability('i can have')
-2.277034

 Perhaps the prevalance of LOLSpeak on the Web should not be underestimated.