The messy business of tokenization

So what exactly is a word, in the context of our N-Gram service?  The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing.  All tokens are case-folded and with a few exceptions, all punctuation is stripped.  This means words like I’m or didn’t are treated as…

0

Wordbreakingisacinchwithdata

For the task of word-breaking, many different approaches exist.  Today we’re writing about a purely data-driven approach, and it’s actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with no insertion penalty applied.  A data-driven approach is great…

0

The fluid language of the Web

We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10.  You can download it here. We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10.  Our findings are interesting: The union of the word set…

1

Using the MicrosoftNgram Python Module

Over the past few posts I’ve shown some samples of the MicrosoftNgram Python module.  Writing documentation is not something engineers I know enjoy doing; in fact the only available documentation right now is through help(MicrosoftNgram).  Here’s an attempt to rectify the situation. To get started, you’ll of course need to get the module, which you…

1

Who doesn’t like models?

If there ever was an overloaded term in Computer Science, it’s models.  For instance, my colleagues in the eXtreme Computing Group have this terrific ambition to model the entire world!  What we’re talking about here is much simpler: it is a representation of a particular corpus. One of the key insights in studying how documents…

0