Language Modeling 102

In last week's post, we covered the basics of conditional probabilities in language modeling.   Let's now have another quick math lesson on joint probabilities.

A joint probability is useful when you're interested in the probability of an entire sequence of words.  Here I can borrow an equation from Wikipedia:

The middle term is the true value of observing the word sequence w1 through wm.  The right term is the approximation when you only have an order-n dataset.  Here's an example using our Python library:

 >>> import MicrosoftNgram
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:2')
>>> s.GetJointProbability('apple pie')
-6.170992
>>> s.GetConditionalProbability('apple') + s.GetConditionalProbability('apple pie')
-6.170992

As expected, the joint probability is the same as the product of the conditional probability [remember — those numbers are log values, and log(a*b)=log(a)+log(b)].

But there's a hitch — what happens if a particular sequence of words had never been observed?  Even in a corpus as large as the web, this is going to happen.  Should the joint probability be 0?  Depending on your application, the answer might be yes -- but for most scenarios, if the word sequence was at all probable, an approximation would be preferable.   This is where smoothing comes in to play.

A number of different smoothing techniques are used today, but in essence we assess what we call a backoff penalty for word combinations for which we've no data.  P(wm|wm-n,...,wm-1) is approximated as P(wm|wm-n+1,...,wm-1) * BO(wm-n,...,wm-1), i.e. we back off to a lower-order n-gram and assess a penalty which is a function of the context.