Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Microsoft Research Speller Challenge is open for business

After a few bumps here and there we have the site up and running. If you prefer a write-up by a...

Author: cthrash99 Date: 01/20/2011

The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as...

Author: cthrash99 Date: 12/27/2010

Did you mean...Schwarzenegger?

You launch your favorite search engine and enter a term or two. Then near the top, the search engine...

Author: cthrash99 Date: 12/15/2010

Well, do ya, P()?

Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for...

Author: cthrash99 Date: 12/13/2010

Perf tips for using the N-Gram service with WCF

The support in Visual Studio for WCF makes writing a SOAP/XML application for the Web N-Gram service...

Author: cthrash99 Date: 12/06/2010

The messy business of tokenization

So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the...

Author: cthrash99 Date: 11/29/2010

Wordbreakingisacinchwithdata

For the task of word-breaking, many different approaches exist. Today we're writing about a purely...

Author: cthrash99 Date: 11/22/2010

The fluid language of the Web

We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for...

Author: cthrash99 Date: 11/15/2010

Using the MicrosoftNgram Python Module

Over the past few posts I've shown some samples of the MicrosoftNgram Python module. Writing...

Author: cthrash99 Date: 11/08/2010

Who doesn't like models?

If there ever was an overloaded term in Computer Science, it's models. For instance, my colleagues...

Author: cthrash99 Date: 11/01/2010

UPDATE: Serving New Models

Today's post was delayed slightly but we have good news — announcing the availability of...

Author: cthrash99 Date: 10/25/2010

Generative-Mode API

In previous posts I wrote how the Web N-Gram service answers the question: what is the probability...

Author: cthrash99 Date: 10/18/2010

Language Modeling 102

In last week's post, we covered the basics of conditional probabilities in language modeling. Let's...

Author: cthrash99 Date: 10/11/2010

Language Modeling 101

The Microsoft Web N-Gram service, at its core, is a data service that returns conditional...

Author: cthrash99 Date: 10/04/2010

What can data do for you?

Let's think of the scale of different lexicons, in terms of order of magnitude: 1,000 - the...

Author: cthrash99 Date: 09/27/2010