The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post.  The second step is to covert each token into numerical IDs.  The data structure used here is a kind of a hash.

You might be asking, why not a trie?  The simple answer is size.  If you have on the order of a billion words in your lexicon, a trie would generally be quite large.  And large usually means slow, especially at this scale.  What we need instead is a data structure that allows us to map the tokens to IDs without storing the lexicon itself, which is gigabytes in storage.

A number of techniques exist in this area but a property many of these have, ours included, is the possibility of false-negatives.  This is to say that if you have a known token, the token will reliably map to a particular ID value, but if you have an unknown token, the ability to detect this is not guaranteed.  Specifically, an in-vocabulary and out-of-vocabulary token can map to the same ID value.  This is the trade-off to get the size down.  Folks who study these sorts of things often talk in terms of bits per token,  They often employ a secondary construct known as a fingerprint, to help reduce the odds of a false-positive.  Of course the fingerprint also contributes towards the bits per token. so it comes down to keeping the error rate acceptably low.

Choosing a probabilistic hash instead of a data structure that embodies the lexicon also means that our service cannot do partial word wildcard lookups (i.e. you cannot find the probability of chris* and get the probability of chris, christopher, christmas, etc.).  It does, however, allow us to use a tree-like structure to facilitate backoff lookup (see this post for more on this) as well as our generative-mode API (post) so it isn't all bad.  Note that in order to produce the list of words in generative mode, we do have the ability to map IDs back to the token.  In other words if we check the round-trip value of mapping a token to an ID and then back, we could test for the validity of the token.  For performance reasons, we don't bother.