The messy business of tokenization

Article
11/29/2010

So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the details.

As noted in earlier blog entries, our data comes straight from Bing. All tokens are case-folded and with a few exceptions, all punctuation is stripped. This means words like I'm or didn't are treated as two tokens each, even though we might consider them to be single words. Here is the aforementioned exceptions in the bing-body:jun09 dataset:

.netc++c#j#g++j++com+gdi+

The dataset qualification is critical here, because we aren't necessarily committed to keeping that set the same moving forward. But this could be a problem for the GetConditionalProbability method, because the definition of this method is the probability of the last word in the given context. Notice I used 'word' instead of token. Therein lies a compromise — for the purpose of GetConditionalProbability we use whitespace as a word boundary. An example might help illustrate this point:

GetConditionalProbability("bing-body:apr10:3", "on no, you didn't")

The input phrase has 5 tokens: oh, no, you, didn, and t. If we were simply using the token boundary we would return the value of P(t|you didn), and ignore the three tokens preceding. But because we're using whitespace as boundaries, we return P(t|you didn)×P(didn|no you) and ignore the two preceding tokens. Subtle, yes, but crucial if you want predictable results.

We also have two special tokens, <s> and </s>. These mean, beginning of segment and end of segment, respectively. These help you get probabilities at the segment boundaries. Here's an example:

 >>> import MicrosoftNgram
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:apr10:2')
>>> for t in s.Generate('<s>', maxgen=6): print t
...
('home', -1.3229439999999999)
('the', -1.389273)
('skip', -1.6477809999999999)
('search', -1.8886769999999999)
('a', -1.938218)
('this', -1.9436990000000001)

This shows that in the body stream, the most common first word is home, and so on.

The messy business of tokenization

Additional resources