The Microsoft Web N-Gram service, at its core, is a data

service that returns conditional probabilities of words given a context.

But what does that exactly mean? Let me explain.

Conditional probability is usually expressed with a vertical

bar: P(w|c). In plain English you would

say: what is the probability of w *given*

c? In language modeling, w represents a

word, and c represents the context, which is a fancy way of saying the sequence

of words that come before w.

The number of words that the service will consider in a

query is known as the *order*, which is

the N in N-gram. The order is

split in to two – one for the word (w) itself, and N-1 for the context (c). For a

1-gram, or a unigram, there is no context at all, but instead the simple probability of

a given word amongst all words. Let’s jump

ahead and show you some real values using the Python library available here:

```
>>> import MicrosoftNgram
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:1')
>>> s.GetConditionalProbability('chris')
-4.1125360000000004
>>> s.GetConditionalProbability('the')
-1.480988
```

For the moment, ignore the details about the instantiation

of the LookupService object (i.e. s), and treat it as a black box that can tell

you the unigram probability. The return

values are negative because they are log values in base-10. Because a probability value, in linear space,

will always between 0 and 1, the same will be between negative infinity and 0 in log

space. So the odds of seeing my first

name is 1:1/P=1/10^(-4.112536) or approximately 1:13000. Contrast this with the odds of seeing the

most common word in English on the web, *the*:

1:30.

But language modeling is often more interesting at

higher-orders of N: bigrams, trigrams, and so on. Let’s try some more examples:

```
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:2')
>>> s.GetConditionalProbability('star wars')
-1.1905209999999999
>>> s.GetConditionalProbability('star map')
-3.7370559999999999
```

The key change from the earlier example is how s was

instantiated, namely the ‘2’ at the very end of the argument. This indicates that we’re interested in

bigrams. There’ll be more on models in

the near future. Anyway, the queries show that given the

context of ‘star’, we are more than 100x times likely to have the word ‘wars’

than ‘map.’ And now for a trigram example:

```
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:3')
>>> s.GetConditionalProbability('i can has')
-2.4931369999999999
>>> s.GetConditionalProbability('i can have')
-2.277034
```

Perhaps the prevalance of LOLSpeak on the Web should not be underestimated.