## The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post.  The second step is to covert each token into numerical IDs.  The data structure used here is a kind of a hash. You might be asking, why not a trie?  The simple answer…

## Well, do ya, P(<UNK>)?

Today we’ll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x)is simply the probability of encoutering xirrespective of words preceding it.  A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide it by the number…

## The messy business of tokenization

So what exactly is a word, in the context of our N-Gram service?  The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing.  All tokens are case-folded and with a few exceptions, all punctuation is stripped.  This means words like I’m or didn’t are treated as…

## Wordbreakingisacinchwithdata

For the task of word-breaking, many different approaches exist.  Today we’re writing about a purely data-driven approach, and it’s actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with no insertion penalty applied.  A data-driven approach is great…

## Generative-Mode API

In previous posts I wrote how the Web N-Gram service answers the question: what is the probability of word w in the context c?  This is useful, but sometimes you want to know: what are some words {w} that could follow the context c?  This is where the Generative-Mode APIs come in to play. Examples…

## Language Modeling 102

In last week’s post, we covered the basics of conditional probabilities in language modeling.   Let’s now have another quick math lesson on joint probabilities. A joint probability is useful when you’re interested in the probability of an entire sequence of words.  Here I can borrow an equation from Wikipedia: The middle term is the…

## Language Modeling 101

The Microsoft Web N-Gram service, at its core, is a data service that returns conditional probabilities of words given a context.  But what does that exactly mean?  Let me explain. Conditional probability is usually expressed with a vertical bar: P(w|c).  In plain English you would say: what is the probability of w given c?  In…