The dirty secret about large-vocabulary hashes

The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post.  The second step is to covert each token into numerical IDs.  The data structure used here is a kind of a hash. You might be asking, why not a trie?  The simple answer…

0

Did you mean…Schwarzenegger?

You launch your favorite search engine and enter a term or two.  Then near the top, the search engine helpfully suggests a possible alternate query, at which point you scratch your head and say “wha?” (or perhaps something more vulgar.) You think you can do a better job?  Now’s your chance to prove it —…

0

Well, do ya, P(<UNK>)?

Today we’ll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x)is simply the probability of encoutering xirrespective of words preceding it.  A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide it by the number…

0

Perf tips for using the N-Gram service with WCF

The support in Visual Studio for WCF makes writing a SOAP/XML application for the Web N-Gram service a pretty straightforward process.  If you’re new to this, the Quick Start guide might be helpful to you.  There are a few tweaks you can make, however, to improve the performance of your application if you intend to…

0