Using the MicrosoftNgram Python Module


Over the past few posts I’ve shown some samples of the
MicrosoftNgram Python module.  Writing documentation is not something engineers
I know enjoy doing; in fact the only available documentation right now is
through help(MicrosoftNgram).  Here’s an attempt to rectify the situation.

To get started, you’ll of course need to get the module,
which you can download here.

The main class is named LookupService.  An instance of this
object encapsulates two crucial pieces of information (a) the user token, and
(b) the language model of interest.  The user token is a GUID issued by
Microsoft Research.  This is something we use to track the amount of usage;
neither the phrases nor models used are tracked in the interest of protecting
users’ privacy.  The language model is dataset against which you can query probabilities.  Details on language models were covered in last week’s post, but in a nutshell there are three properties to a model: source, version, and order.  The following instantiations of the constructor are all functionally equivalent, provided that (i) you use your actual GUID, not xx.., and (ii) for the first case, you’ve specified an environment variable called NGRAM_TOKEN and its value set to your GUID.

>>> s = MicrosoftNgram.LookupService()
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService(token='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService('xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService(token='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',model='bing-body/jun09/3')
>>> s.GetModel()
'bing-body/jun09/3'
>>> s = MicrosoftNgram.LookupService('xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx','bing-body/jun09/3')
>>> s.GetModel()
'bing-body/jun09/3'

 I prefer to set the environment variable, since I can’t be bothered to memorize a GUID.  Speaking of environment variables, note that MicrosoftNgram uses urllib under the covers, so if you need to specify a proxy, set HTTP_PROXY appropriately.

Once you have a LookupService object, you can call the various methods. 

>>> s = MicrosoftNgram.LookupService(model='bing-body/apr10/5')
>>> s.GetConditionalProbability('happy cat is happy')
-0.93900499999999998
>>> s.GetConditionalProbability('happy cat is sad')
-4.2167089999999998
>>> s.GetJointProbability('kthxbai')
-7.6080370000000004

 Well it’s good to know that happy cat is 1000x more likely to be happy than sad.  What else can happy cat be?

>>> for t in s.Generate('happy cat is', maxgen=5): print t
...
('always', -0.36325089999999999)
('a', -0.89422170000000001)
('happy', -0.93900499999999998)

So we know that happy cat is never sad (well, most likely anyway — bing-body/apr10 has a unigram cutoff of 10.)  We can further infer that when computing the conditional probability above, we must have backed off to a lower-order n-gram.

 

Comments (1)

  1. Viswanath says:

    We have one problem: We need all 5-grams which have a particular word in them (Say for example: All 5-grams which have sachin in it). Currently with generate service we can only get  all "one succeeding" word after a given context(phrase). Is there any way so that we can get all 5-grams which have a particular word in them.

    We figure out this way:

         1. First call the generative service: through which we will get all one succeeding words.

         2. Now we have a context say "sachin tendulkar" and then we make a one more Generative Service call through which we will get all "one succeeding" words after context. And we will multiple both     probabilities ( Joint Probability ).

    Now the problem is: Through this procedure we can only get succeeding words, we need left context of a given particular word. If you have any solution, please let us know.