The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post.  The second step is to covert each token into numerical IDs.  The data structure used here is a kind of a hash. You might be asking, why not a trie?  The simple answer…

0

Did you mean…Schwarzenegger?

You launch your favorite search engine and enter a term or two.  Then near the top, the search engine helpfully suggests a possible alternate query, at which point you scratch your head and say “wha?” (or perhaps something more vulgar.) You think you can do a better job?  Now’s your chance to prove it —…

0

Well, do ya, P(<UNK>)?

Today we’ll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x)is simply the probability of encoutering xirrespective of words preceding it.  A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide it by the number…

0

Perf tips for using the N-Gram service with WCF

The support in Visual Studio for WCF makes writing a SOAP/XML application for the Web N-Gram service a pretty straightforward process.  If you’re new to this, the Quick Start guide might be helpful to you.  There are a few tweaks you can make, however, to improve the performance of your application if you intend to…

0

The messy business of tokenization

So what exactly is a word, in the context of our N-Gram service?  The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing.  All tokens are case-folded and with a few exceptions, all punctuation is stripped.  This means words like I’m or didn’t are treated as…

0

Wordbreakingisacinchwithdata

For the task of word-breaking, many different approaches exist.  Today we’re writing about a purely data-driven approach, and it’s actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with no insertion penalty applied.  A data-driven approach is great…

0

The fluid language of the Web

We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10.  You can download it here. We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10.  Our findings are interesting: The union of the word set…

1

Using the MicrosoftNgram Python Module

Over the past few posts I’ve shown some samples of the MicrosoftNgram Python module.  Writing documentation is not something engineers I know enjoy doing; in fact the only available documentation right now is through help(MicrosoftNgram).  Here’s an attempt to rectify the situation. To get started, you’ll of course need to get the module, which you…

1

Who doesn’t like models?

If there ever was an overloaded term in Computer Science, it’s models.  For instance, my colleagues in the eXtreme Computing Group have this terrific ambition to model the entire world!  What we’re talking about here is much simpler: it is a representation of a particular corpus. One of the key insights in studying how documents…

0

UPDATE: Serving New Models

Today’s post was delayed slightly but we have good news — announcing the availability of additional language model datasets.  As always, the easiest way to get a list is to simply navigate to http://web-ngram.research.microsoft.com/rest/lookup.svc.  Shown below are the new items, in URN form: urn:ngram:bing-title:apr10:1 urn:ngram:bing-title:apr10:2 urn:ngram:bing-title:apr10:3 urn:ngram:bing-title:apr10:4 urn:ngram:bing-title:apr10:5 urn:ngram:bing-anchor:apr10:1 urn:ngram:bing-anchor:apr10:2 urn:ngram:bing-anchor:apr10:3 urn:ngram:bing-anchor:apr10:4 urn:ngram:bing-anchor:apr10:5 urn:ngram:bing-body:apr10:1 urn:ngram:bing-body:apr10:2…

0