An academic evaluation of the Office 2007 contextual spelling checker

 A few days ago, I discovered an analysis of our Office 2007 contextual speller carried out by Prof. Graeme Hirst, from the University of Toronto:  An Evaluation of the Contextual Spelling Checker of Microsoft Office Word 2007.

We have discussed this new context-sensitive speller on several occasions on this blog (as well as here) and it is nice to see that it is attracting the attention of researchers in the academic world.

It’s an interesting paper, which provides some food for thought, however, especially with respect to how “aggressive” we should be in our approach to recall.

His conclusion nicely sums up our trade-offs and dilemmas (emphasis mine):

In an evaluation on 1400 examples, it is found to have high precision but low recall — that is, it fails to find most errors, but when it does flag a possible error, it is almost always correct.


The contextual spelling corrector in Microsoft Office Word 2007 is a cautious (low recall) but believable (high precision) system. However, its overall performance, as measured by F, is much poorer than that of the trigram method of Mays et al (1991).

The trade-off between the two systems is a difficult one. In simple terms, better performance is better; but believability is an important attribute for a consumer-level system (“if Word says it’s wrong then it’s wrong”) and could well be considered worth sacrificing performance for.  The problem with this, however, is that as users become familiar with the system, their expectations will rise and believability will start to apply also to what Word fails to flag (“If Word says it’s right then it’s right”).

A system that is more visibly error-prone might actually serve users better.

The methodology used by Prof. Hirst and his colleagues to evaluate the system deserves a few comments:

·         They automatically induced real-word errors by replacing words by any spelling variation found in the lexicon of the ispell spelling checker. They limit the manipulation to an edit distance of 1 manipulation. So these errors are not natural mistakes.

·         They did not consider “malapropisms” (real-word mistakes) involving closed-class words and words formed by the insertion or deletion of an apostrophe or by splitting a word: this means they exclude pairs which we have found to be extremely frequent in real texts (then/than; your/you’re; its/it’s; everyday/every day; to/too; their/there/they’re…). These pairs feature prominently in any analysis of real mistakes, especially in the literature devoted to English as a Second Language. Everyone knows that many native speakers of English have a lot of difficulty mastering these confusables, which is why we decided to specifically target them.

·         They did not include phonetic confusables such as cymbal/symbol, principle/principal, pear/pair, there/their which have an edit distance > 1.

The categories they did not include in their tests are precisely those which we focused on because flagging these real and frequent mistakes is very useful for users of Office and Word. So assessing the “performance” of a system by ignoring these may be a bit unfair, at least if one equates “performance” and “usefulness” (will users find the system more useful if we flag “have not lost monkey” (à money), a rare and unnatural mistake, or if we flag “it is to expensive”, a mistake our data shows is very frequent and which we seem to be good at flagging?). Recall would be a lot higher if pairs involving closed-class words and the standard phonetic confusables above were taken into account (our own metrics based on a large corpus of real mistakes shows that our recall is in fact higher than the 20-25% found by Hirst, and is around 40%). The alternative methods which he proposes have even higher recall (50%), but their precision (50%) is way lower than our system’s (96%). Hirst clearly favors a recall-based performance. His assumption is: do people want to use a system like Microsoft’s, which only spots one mistake out of 5 (our metrics show it’s in fact closer to 2 out of 5, i.e. 40%) and is right nearly all the time? Our assumption is: would users really want a system based on the trigram method advocated by Prof. Hirst, which flags 50% of the mistakes but is wrong in 50% of the cases? The feedback we generally get indicates that our users tend to prefer unobtrusive tools and switch off a tool which they consider unreliable.

Interesting debate, isn’t it? I am really grateful to Prof. Hirst for making this discussion possible.

So, what do you think? We are interested in hearing your opinion. Do you prefer a tool which casts the net as wide as possible and catches many mistakes, at the risk of being frequently wrong and of creating many false flags (false positives), or do you prefer a tool which does not catch all possible mistakes, but which you can trust when it does catch one? Do not hesitate to leave your comments below…

Thierry Fontenelle – Program Manager


Comments (8)

  1. Mark Sowul says:

    Well, I’ll bite – although the "if Word thinks it’s right then it’s right" concern is quite valid, I think accuracy is more important, otherwise just from a visual perspective, if there is a significant number of blue lines present, each of them loses its significance, especially if half of them are meaningless.  Just as it’s hard to spot the real spelling errors when lots of names (or technical terms) are flagged.  One has to comb through and ignore the invalid ones in order for the real ones to stand out.

  2. Mark Sowul says:

    Unrelated question as I proofread my comment after the fact: why isn’t a speller/grammar checker made available system-wide (i.e. in text boxes)?  Even if it’s just a very basic one by default that can be replaced by more sophisticated ones (e.g. if you install Office, the Office speller can be used).

  3. Katie says:

    This is fascinating. As an ESL tutor, I encounter tons of these exact errors. Most of my students, when faced with a zig-zag underline, automatically accept Word’s recommendations; they assume the program knows more than they do about this foreign language. However, they remain aware that there may be mistakes even after Word has made all its recommendations, and they usually have other students or a tutor look over their work. In my experience, the above is almost universal. So, from my perspective, your high precision/low recall method is far more useful than that proposed by Hirst.

  4. Terry says:

    I have to agree that fewer matches and higher accuracy is desirable. I’ve developed an online grammar checker ( and my goal is to help reduce errors, not find every single one (since the false positive cost is too high). I prefer to have users know that our site will help them, but it’s up to humans (e.g.: their teachers) to take them to the next level.


  5. Iain says:

    I tend to batch-spell-check after I’m done writing, which I find less distracting.  I almost always know how to spell the word, so I’d rather have a few wrong results (which I can easily discount) and catch more of my errors.  Is the algorithm so designed that users could tune it (with a slider) to their confidence in the language?

  6. Z.Y. Niu says:

    I prefer the high-recall but low-precision system. I often used Word to check English paper written by myself and expect it can spot all possible errors since I am not a native English speaker.

  7. Alex says:

    I usually prefer to escape Office and Word spell checking as it makes me to forget English. As I am not a native speaker I must always be aware of proper spelling.  In this case I think is better  to check entire text at the end of writing using online services such as http://www.spellchecker/ Here I am learning on my own mistakes and do not rely on Office and Word spell check programs which are not perfect.

Skip to main content