The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post.  The second step is to covert each token into numerical IDs.  The data structure used here is a kind of a hash. You might be asking, why not a trie?  The simple answer…

0

Well, do ya, P(<UNK>)?

Today we’ll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x)is simply the probability of encoutering xirrespective of words preceding it.  A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide it by the number…

0

Perf tips for using the N-Gram service with WCF

The support in Visual Studio for WCF makes writing a SOAP/XML application for the Web N-Gram service a pretty straightforward process.  If you’re new to this, the Quick Start guide might be helpful to you.  There are a few tweaks you can make, however, to improve the performance of your application if you intend to…

0

The messy business of tokenization

So what exactly is a word, in the context of our N-Gram service?  The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing.  All tokens are case-folded and with a few exceptions, all punctuation is stripped.  This means words like I’m or didn’t are treated as…

0

Wordbreakingisacinchwithdata

For the task of word-breaking, many different approaches exist.  Today we’re writing about a purely data-driven approach, and it’s actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with no insertion penalty applied.  A data-driven approach is great…

0

The fluid language of the Web

We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10.  You can download it here. We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10.  Our findings are interesting: The union of the word set…

1

Using the MicrosoftNgram Python Module

Over the past few posts I’ve shown some samples of the MicrosoftNgram Python module.  Writing documentation is not something engineers I know enjoy doing; in fact the only available documentation right now is through help(MicrosoftNgram).  Here’s an attempt to rectify the situation. To get started, you’ll of course need to get the module, which you…

1

UPDATE: Serving New Models

Today’s post was delayed slightly but we have good news — announcing the availability of additional language model datasets.  As always, the easiest way to get a list is to simply navigate to http://web-ngram.research.microsoft.com/rest/lookup.svc.  Shown below are the new items, in URN form: urn:ngram:bing-title:apr10:1 urn:ngram:bing-title:apr10:2 urn:ngram:bing-title:apr10:3 urn:ngram:bing-title:apr10:4 urn:ngram:bing-title:apr10:5 urn:ngram:bing-anchor:apr10:1 urn:ngram:bing-anchor:apr10:2 urn:ngram:bing-anchor:apr10:3 urn:ngram:bing-anchor:apr10:4 urn:ngram:bing-anchor:apr10:5 urn:ngram:bing-body:apr10:1 urn:ngram:bing-body:apr10:2…

0

Generative-Mode API

In previous posts I wrote how the Web N-Gram service answers the question: what is the probability of word w in the context c?  This is useful, but sometimes you want to know: what are some words {w} that could follow the context c?  This is where the Generative-Mode APIs come in to play. Examples…

1

Language Modeling 102

In last week’s post, we covered the basics of conditional probabilities in language modeling.   Let’s now have another quick math lesson on joint probabilities. A joint probability is useful when you’re interested in the probability of an entire sequence of words.  Here I can borrow an equation from Wikipedia: The middle term is the…

0