Microsoft Research Speller Challenge is open for business

After a few bumps here and there we have the site up and running.  If you prefer a write-up by a professional writer on the subject, I'll refer you to this announcement.

Some of you may be wondering why certain design choices were made in the process of designing this challenge.  I hope to address some of these today.  Some additional FAQs are available on the official site.

Why a web service?

At the heart of this, much like it is for the Web N-Gram project, we're believers in data-driven discovery.  We believe that the programmatically-accessed Web is very much the present and the future. 

We also don't want to be in the business of hosting your code.  Aside from the obvious security implications of us running third-party code here, we don't want to dictate what platform, tools, etc., that you can use.  You decide, you host.

Why REST?

In a word, expedience.  Because our input and output specifications are so simple, using a more formal structure such as SOAP would only serve to slow everything down, both in terms of development and runtime.

Why the restrictions on the web service?

There are two main restrictions on the web service.  No redirects, and that MIME type must be text/plain.  Both are motivated by our desire to prevent a malicious participant from making us inadvertently spamming an unsuspecting web site.  Requiring the MIME type to be text/plain prevents you from pointing our code to a regular Web site (these generally return text/html as the MIME type, as they should.)  Similarly, if you could redirect the request with a 301/302 HTTP response, you could easily appear to be a proper site at the time you're registering your web service URI, but redirect us to an unsuspecting third-party URl when it comes time to actually evaluate the speller.  Granted these are not air-tight defenses against malicious intent, but it's better than nothing.

What about the test data?

We've prepared a human-judged corpus to get you started with the process.  There are ~6000 phrases in this dataset, and these are the phrases with which we will call your web service up until the final submission period.   For the final submission period we will obviously have a different corpus.  We caution against tuning your algorithm too closely to the test dataset (a subject for a future blog.)

We're also encouraging folks with training data to share with the community.  For Intellectual Property reasons we won't host these datasets, but we do provide a mechanism through which you can let the community know the availability.

Where can I host my service?

Naturally I would advise you to host your code in Windows Azure but there are no free options as of this writing there unless you already have an MSDN account (which itself is not free.)  For the more budget-challenged, Amazon and Google both offer some hosting for free.  (This is not an endorsement.)  There are surely many others.  Keep in mind that while we don't intend to ever over-access your site, you are ultimately responsible for any charges or other issues with the hosting service of your choice, including rate-limiting.

What if I have more questions?

E-mail is always an option, but we strongly encourage you to use one of the social media options, for two reasons: (a) someone else have the same question and would benefit from a shared response, and (b) there are more of you than there are of us -- and one of you may already know the answer!  This about community,  folks!