Why a really large dictionary is not a good thing


Sometimes you'll see somebody brag about how many words are in their spell-checking dictionary. It turns out that having too many words in a spell checker's dictionary is worse than having too few.

Suppose you had a spell checker whose dictionary contained every word in the Oxford English Dictionary. Then you hand it this sentence:

Therf werre eyght bokes.

That sentence would pass with flying colors, because all of the words in the above sentence are valid English words, though most people would be hard-pressed to provide definitions.

The English language has so many words that if you included them all, then common typographical errors would often match (by coincidence) a valid English word and therefore not be detected by the spell checker. Which would go against the whole point of a spell checker: To catch spelling errors.

So be glad that your spell checker doesn't have the largest dictionary possible. If it did, it would end up doing a worse job.

After I wrote this article, I found a nice discussion of the subject of spell check dictionary size on the Wintertree Software web site.

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (42)
  1. Of course, this is only true if you have a "binary" spell check. An alternate algorithm could use a Bayesian algorithm to determine the likelihood of word misuse based upon the user’s habits. More:

    http://www.tallent.us/CommentView.aspx?guid=55ca45e6-f5bc-4ccf-9d15-49f3d20a57ba

  2. mike says:

    Using the OED as an example isn’t quite fair, though — it is by definition a historical dictionary, precisely not one that attempts to canonize the contemporary lexicon. As a bad analogy, it would be like having syntax checking in VB.NET that allowed any syntax since Dartmouth BASIC. :-)

  3. (6) says:

    Huh? Sorry, the Oxford English Dictionary is a historical dictionary? This is untrue, since it is still used heavily in England and contains many of our modern colloquialisms.

    It is true that it does contain historical vocabulary, but I believe it to be incorrect to classify it as a historical dictionary.

  4. David S says:

    Speaking of dictionaries, on my way home from work yesterday I was thinking about how lame many program’s spelling is (in particular, Photoshop). Then I got to thinking: what if the OS provided a dictionary that programs could tie into? Wouldn’t that be grand? Then, as you add words to your dictionary, those are available to any applications that query the dictionary. Of course, controlling the addition of words may get a bit hairy, but what does everyone think?

  5. Cooney says:

    Congratulations, you’ve just invented ispell! where would you like your copy of /usr/dict/words delivered?

  6. Derek says:

    Agreed with Richard about your implied argument of the antiquity of binary spell checks. However, I’m not certain a user would want to have to train their spell checker. But I’m not discounting your approach.

    The article (but not the author of this blog entry) made mention of using multiple dictionaries. A medical researcher might wish to include a medical dictionary. From the user’s perspective, this would be the same as using one huge dictionary where each word has metadata (this word is a medical term, this word is a general-purpose term, etc.) and the user can choose which subset(s) of words to spell-check against (English, general-purpose, American spellings only).

    Yet another possibility would be to categorize words and spellings by usage frequency and words below a certain threshold would be identified as misspelled, but this approach shouldn’t be used alone.

  7. A regular viewer says:

    Better still, lets all learn to spell properly.

    Seriously, most of the time, I turn it off. It’s annoying and close to a third of my vocabulary is not in the database, not to mention the horror of having all proper nouns flagged.

  8. keithmo [exmsft] says:

    Back when Intel had just introducted the Pentium processor, "Pentium" was not in MS Word’s dictionary. The suggested replacement? "Penis".

  9. Claw says:

    what if the OS provided a dictionary that

    > programs could tie into? Wouldn’t that be

    > grand? Then, as you add words to your

    > dictionary, those are available to any

    > applications that query the dictionary.

    …and then wait for all your competitors to sue you for taking advantage of your monopoly position by bundling.

  10. Dave says:

    in another life, i worked on the spell checker for the last DOS version of microsoft works — ver 3.0, i think it was. my favorite bug was, in the german version, if you ran the spellchecker on the compound word "Feuer-Betriebsunterbrechungs-Versicherungsbedingungen" it would crash. now THAT’s a word. :-)

  11. Mat says:

    Well, I’ve always been a proponent of two dictionaries… common vs. all. Shreaking red lined words if they match neither, but a more gentle "not common" purple or something.

  12. Vigor says:

    I guess that’s a special problem of the English language. German doesn’t have so many obscure words that are similar to common words. I’m sure this has to do with the fact that German words are generally longer than English ones. Also, a recent edition of the German Duden contains only about 120,000 entries, which includes many compound words.

    Speaking of compound words: I’m sure the way we form words by joining several existing ones (like "Luftfahrtgesellschaft"(*), meaning "air line") drives authors of German spellchecking software insane.

    (*) "Fluggesellschaft", meaning the same thing, is much more common.

  13. josh says:

    A small dictionary doesn’t completely solve the problem, since you still have common words that are a few typos away from other common words: of vs. off, effect vs. affect, you vs. your vs. yours, etc. Even humans have trouble spell checking some of these.

  14. Henk Devos says:

    David S:

    Speaking of dictionaries, on my way home from work yesterday I was thinking about how lame many program’s spelling is (in particular, Photoshop). Then I got to thinking: what if the OS provided a dictionary that programs could tie into? Wouldn’t that be grand? Then, as you add words to your dictionary, those are available to any applications that query the dictionary

    That’s how it is on Mac. If you type something in TextEdit (the equivalent of Notepad) you get the same spelling check as in Mail. There’s also a general Text To Speech that all applications can use.

    Claw:

    …and then wait for all your competitors to sue you for taking advantage of your monopoly position by bundling.

    Only if Microsoft makes sure competing spell checkers can’t work properly.

    But to get back to the original subject:

    The real problem is that spelling checkers are dumb. They have no context. They don’t know when to use of and when to use off. As long as this is not solved spelling checkers will never work great, although they are still useful.

    I wonder how spelling checkers handle languages like Turkish, where you construct sentences by adding suffixes to words…

  15. J. Edward Sanchez says:

    Here’s another example of something that doesn’t get flagged as an error, for the same reason: "spell check". Everyone* I know, including every single person in this thread, uses this phrase instead of "spelling check".

    *Wizards excluded. They always seem to know better, for some reason.

  16. Claw:

    > …and then wait for all your competitors to

    > sue you for taking advantage of your

    > monopoly position by bundling.

    > Only if Microsoft makes sure competing spell

    > checkers can’t work properly.

    Microsoft didn’t prevent Netscape, Quicktime or RealPlayer from working properly – they all worked fine – yet look at the antitrust trial and the recent European Union decision.

    (In fact, the cases where it was claimed in the trials that MS sabotaged other apps where bogus. See the whole Realplayer G5 Beta fiasco – it ended up being a problem with Real’s installer being badly written).

    So yes, they would be sued by their competitors for bundling. Even if their competitors were working properly. Because that’s what just happened, and they got fined $500MM for the privilege.

  17. "Better still, lets all learn to spell properly."

    Yes, lets. Then well have no trouble reading each others writing. And we wont need spilling checkers!

  18. SI says:

    speaking of bundling, does this mean MS will be sued for its firewall and antivirus in SP2?

    will asus be sued for adding hardware firewall to its mainboards? where will this insanity end?

  19. Mike says:

    I notice two people here have misspelled let’s as lets e.g. "Better still, lets all learn to spell properly." Plus in the above posts we have "intoducted", and "shreaking".

    Word does support additional specialist add-in dictionaries for Medical, Legal use etc and has done so for many, many versions. Since Word 2002 it has been possible to change the squiggle colour (but not on a per-lexicon basis): http://support.microsoft.com/?scid=kb;es-es;E284845 .

    "The real problem is that spelling checkers are dumb. They have no context." This is because Word does not give the spell-checker context, not because the spell-checker can’t handle it.

    "not to mention the horror of having all proper nouns flagged. "

    I wonder which spell-checker does this?

  20. "I notice two people here have misspelled let’s as lets…"

    Two? I’d hoped it was obvious that my comment was satirical. Sometimes throwing in a smileyface feels too much like laughing at your own joke.

    For that matter, it’s quite possible that the previous post that misspelled "let’s" was also satirical. After all, it was a spelling error that a spelling checker would not catch. Just like the four–count ’em–four spelling errors in my post.

    In any case, here’s the missing smileyface… :-)

  21. A regular viewer says:

    Agreed.

    Yet, how much will it break functionlality when checks are added to ignore word capitalised within sentences or when enclosed within quotation marks? Being a word processor, all emphases are added using by changing font display, bold, italics, etc. Therefore, with sentences like,

    In continuation of our discussions with your Mr. Srinivasan Raghavan, we herewith provide our most quotation for two of our Sudharshan Nagar Residential Plots. Please ensure to receive the <i>patta</i> before making the payment.

    "Srinivasan Raghavan", "Sudharshan Nagar" and "patta" are not shown as spelling mistakes.

  22. A regular viewer says:

    "all emphases are added using by changing" –> "all emphases are added by changing"

    "most quotation" –> "most attractive rates".

    note: The above is from an actual document, with just the names changed.

  23. A regular viewer says:

    Apologies. Using a text editor to point out issues with a word processor is very confusing!

    The sentence quoted above should read

    In continuation of our discussions with your Mr. Srinivasan Raghavan, we herewith provide our most quotation for two of our Sudharshan Nagar Residential Plots. Please ensure to receive the "patta" <i>before</i> making the payment.

  24. David S says:

    I don’t think MS would get sued for offering a common dictionary for others to use. In fact, if they just opened up the Office dictionary, that’d be great. Not very antitrust.

    Now, if it was setup so that MS apps would access the dictionary faster… that’d be a problem.

    Shared dictionaries make sense (notice: not necessarily the same spell checker, just the dictionary) as other OSes are implementing it. There should be the common dictionary plus a custom one for each user.

  25. redvamp128 says:

    Go to any language transltion pages and or programs- For example- Type up a paper about 1 page with clear sentence structure and proper grammar- spell check…then take that page and have it translate it for example into French…then take that French page and have it translate back into English and run it through a Spell Checker and Grammar checker- I can tell you it does not look pretty.

  26. just me. says:

    You can access the Office spell checker using COM. I’ve done it, so I know it’s possible.

  27. redvamp128 says:

    http://blogs.msdn.com/oldnewthing/archive/2004/04/01/105582.aspx#105617

    Case in point- anyone else read the book.

    "THE WIZARD OF OUNCES." ?"Where the reader sees the word as being an abbreviation when it is a proper name of a place in the book. Oz because it has a period behind it.

  28. A regular viewer says:

    No, the "lets" in my post was not deliberate. I made a mistake. But the issue(s) remains.

    1) The "lets" is not flagged ’cause it is a valid word.

    2) As mentioned above, homonyms can really **ss one OF (sic).

    3) It is extremely annoying to see red line all over the document, when one works with fonts for other languages. Due to the really messy situation of a standardised Tamil Keyboard Layout, despite honourable efforts by the UNICODE team, almost 95% of us use a TrueType Font, to key in Tamil. Further 75% of all the correspondence is bi-lingual, i.e., English in Roman script interspersed with Tamil in Dravidian script. So turning the spelling off is generally not preferred. I know this is not the fault of the Spelling Check feature, but the annoyance remains.

    4) For the life of me, I cannot get MS Word to use British spelling. I know it is possible, but somehow I am an unable to achieve it. The "S"-s & "U"-s are very important to be me.

    5) I live in India, work in India for Indians. Cities, People, Organisations, all have un-Anglo-Saxon names, to coin a term. At least an option that skips checking for spelling mistakes, when one capitalises a word, within a sentence, will go a long way in reducing the discomfort ("horror" was an overstatement, I admit) of seeing red lines all over my documents.

    Regards

    Kaushik Janardhanan

  29. n4cer says:

    Built-in spelling/grammar checking will be available in Longhorn via the System.NaturalLanguageServices namespace.

    http://longhorn.msdn.microsoft.com/lhsdk/ref/system.naturallanguageservices.aspx

    Background Spelling for Avalon TextBoxes

    http://wesnerm.blogs.com/net_undocumented/2003/11/background_spel.html

    Spell Checked Text Box (includes screenshot)

    http://www.longhornblogs.com/rdawson/archive/2004/02/09/2451.aspx

  30. A regular viewer says:

    Interesting. Looking forward to that.

    Registry, User Information, Word Lists. Any other system wide dbases that are/can-be exposed?

  31. Mike says:

    1) that’s where a gramamr checker comes in handy

    2) yes, but you can force common homonyms to be flagged so that you can manually check for them

    3-5) UK spelling is easy (as is Australian & Canada). Make sure your computer is properly set up with UK English for your keyboard language. Unfortunately almost every OEM who sells in the English speaking world sets up

    the computer with English US settings, and Windows, Office etc follow that. Word tags each word in a document according to the keyboard language.

    You can also select any text range and mark it to skip spelling and grammar checking. Use the Set Language dialog. Or you could simply mark that text as Tamil etc so that the English proofing tools skip over them. Setting up multiple keyboard languages that can be toggled is useful for multilingual users. Note that you do not have to change the keyboard LAYOUT, only the keyboard LANGUAGE.

    Once a document is "finished" I would recommend tagging it all to skip spell-checking. That means that you and subsequent readers will not have to endure red squiggle horrors.

    ***********

    What I object to, is Word listing several dozen "languages" that it does not provide any support for, or does not actually differentiate from other "languages". It actually confuses "languages" with "locales" in its UI as do other Microsoft products and many pages on http://www.microsoft.com .

  32. n4cer says:

    I can’t think of any other DBs currently.

    Contacts and Identity would be covered under user information.

    WinFS, of course, will allow access to different file types (to varying degrees depending on the ISV).

    Speech is also to be enhanced and made more usable for systemwide command/control in addition to dictation.

    http://longhorn.msdn.microsoft.com/lhsdk/speech/speechconcepts.aspx

    There’s also a standard command API so that localized keybindings and different input methods are automatically supported for standard Windows commands.

    http://longhorn.msdn.microsoft.com/lhsdk/ref/msavalon.windows.commands.aspx

  33. Mike says:

    Being a word processor, all emphases are added using by changing font display, bold, italics, etc.

    You can use Word styles to create a text style that is (for example) italicised and which has the skip-spell-checking flag on, or which is tagged as a different language.

    Word could implement options to ignore all capitalized words just as it has options to ignore words in CAPS etc. However people generally forget these are on and then they complain that Word doesn’t pick up their errors. Proofing tools are great for a quick check, but they are not the final word in document review.

    >I don’t think MS would get sued for offering a common dictionary for others to use. In fact, if they just opened up the Office dictionary, that’d be great.

    Microsoft licenses most of its proofing tools from other companies, so it does not have the rights to open them up in many cases. There are literally hundreds of lexicons involved: spelling, grammar, thesaurus, speech, handwriting, hyphenation etc multiplied by over three dozen languages.

    >You can access the Office spell checker using COM. I’ve done it, so I know it’s possible.

    Certainly, and you can do it as documented via VBA Word automation, but unless you have a special license you can’t use the API to access the contents directly (for the reason cited above).

  34. Marc Wallace says:

    A regular viewer suggests: "Better still, lets all learn to spell properly."

    Bravo!

    Spell checkers didn’t really exist even just a few decades ago… yet many books were published with only a few spelling and grammar errors.

    Today, with the prevalent dependence on the word processor to check both, I see many, many more errors.

    I did not even spell check on my resume. If there was a mistake, it was an honest one, and a reflection of my true writing style. (and yes, I make mistakes) I did, of course, ask people with good eyes to review it. ;-)

    Besides, a good third (if not more) of my writing is technical: pseudocode or documentation. I’m a programmer. Even with the formal quoting mechanisms I use, there’s no way to tell a word processor "these styles represent content in a separate language, don’t check them".

  35. Craig says:

    Mike wrote: "Besides, a good third (if not more) of my writing is technical: pseudocode or documentation. I’m a programmer. Even with the formal quoting mechanisms I use, there’s no way to tell a word processor "these styles represent content in a separate language, don’t check them". "

    In MS Word, I create paragraph styles for pseudocode. In the style definition dialog, I select ‘Language’ and set the ‘Do not check spelling or grammar’ option. This really cuts down on the false positives in my documents.

  36. Mat says:

    OK, have to vent about the EU crap.

    I can’t think of anything more competitive than media players. No one has a majority market position, and most users need to install multiple players to use all of the content they run into.

    And unless Real was the *only* player on the planet, I won’t install that trash. RN really need to figure that out.

    Yet here’s the EU, grabbing $617M so they can level the pitch. And line their pockets. At one time, this was called /extortion/.

    I really wish MS would simply stop selling and/or supporting nations involved with the EU. Can you imagine the grinding halt this would accomplish? Get your $617M worth out of it, Mr. Gates!

    Listen, I know MS hasn’t been perfect about this. But last time I checked, it’s the US that’s leading the innovation here. If the US gov’t is ok with how the media player sitch looks, why does the EU care?

    Oh ya — money.

  37. A regular viewer says:

    Remaining off-topic, It makes no difference, MS will still win. At least, so says http://www.pbs.org/cringely/pulpit/pulpit20040401.html

    The conclusion

    <quote>

    Justice is blind, slow, and unequal. What makes this possible is a legal system designed for the late 18th century and operated by a government that effectively believes that while antitrust matters to individuals and companies its effect on nations cancels out. Only it doesn’t if the companies involved are as big and powerful as Microsoft or Intel or Wal-Mart — companies of near-infinite resources and near-total fixation on executing a global strategy.

    There are only two ways for a society to address such taking advantage of a legal system. One way is to drag that legal system into the 21st century, which isn’t going to happen in America. The other way is to dramatically simplify the legal system along the lines of nomadic justice where there are no prisons nor even capability for collecting damages, so all correction comes down to death or maiming. That isn’t going to happen, either, so Microsoft wins.

    </quote>

  38. Mike says:

    Marc wrote: there’s no way to tell a word processor "these styles represent content in a separate language, don’t check them"

    Interesting claim when the post immediately before described how to do it. This is of course is quite common on newsgroups, where someone posts a question just after the post with the answer. Everyone thinks their problem is unique I guess.

  39. Joku says:

    Is there a keyboard shortcut for toggling the Word spellchecking or language quickly when writing, without resorting to mouse at any point? Wouldn’t it be fuckin* cool if ‘language services’ had its own sort of Intellisense that would (optionally) popup that small window with different spellings of a word and their meaning etc while writing ;-)

  40. Martha says:

    "Is there a keyboard shortcut for toggling the Word spellchecking or language quickly when writing…?" Yes: use the afore-mentioned keyboard-layout switching mechanism. I.e. add a keyboard layout, and assign it a language that doesn’t have a spelling dictionary available. Assign your desired shortcut for switching keyboards (I use left Alt + Shift), and turn on the systray icon (so you know which language you’re using).

  41. Mike says:

    Wouldn’t it be fuckin* cool if ‘language services’ had its own sort of Intellisense that would (optionally) popup that small window with different spellings of a word and their meaning etc while writing ;-)

    The Research Pane in Word 2003 allows something similar. The pane can be set to a thesaurus or encylopaedia or some other plug-in so that it displays relevant material for words in your document.

  42. Mike says:

    >Is there a keyboard shortcut for toggling the Word spellchecking or language quickly when writing, without resorting to mouse at any point?

    The answer depends on the effect you’re trying to achieve:

    * to hide the squiggles: Create a macro that toggles Tools > Options > Spelling & Grammar > Check Spelling as you type (and the grammar option).

    * to mark the upcoming text to skip spell-checking – either use Martha’s trick OR create a macro to set the language flag to skip spell-checking OR use a macro to switch to a different text style that has the skip spell-checking attribute OR use a variant one of these ideas

Comments are closed.

Skip to main content