Well, do ya, P()?

Today we'll do a refresher on unigrams and the role of the P(<UNK>).

As you recall, for unigrams, P(x)is simply the probability of encoutering xirrespective of words preceding it.  A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide it by the number of total words.  This would be fine until you encounter a word you'd never seen before—what should the unigram probability of that term be?

One option is to say P(x) is 0 if x is out-of-vocabulary (OOV).  This would be the right answer if you were dealing with unsmoothed models.  What we have instead are smoothed models, which means that we like to attribute some nonzero probability to words however unlikely the particular words are. If you were computing joint probabilities involving multiple terms, you likely don't want a single term to 'doom' the overall probability to 0 just because it was OOV.  So what we've done here is give the unknown word a small probability mass, and, since the overall probability needs to sum up to unity, we take it out of the denominator.  If the counts of a word x is expressed as C(x) and count of all words expressed as C(*), you'd have:

P(x) = C(x) / [1 + C(*)]P(<unk>) = 1 / [1 + C(*)]

What the above equation shows is that we've given unknown words the equivalent probability of a word observed just once.

But what happens if you encounter multiple unknown terms while evaluating a joint probability?  Since the probability mass represents all unknown terms, it has to be distributed amongst unknown terms.  Therefore you divide it n-ways; the contribution of the unknown terms to the overall joint probability is thus [P(<UNK>)/n]^n.