... a modest goal ...

Article
07/14/2008

... or (for those who understand the cultural reference here): all your errors are not belong to us.

Well, we all (i.e. us non-native speakers of English, myself as a native German included) would like a tool that could just take what we write and turn it into grammatical and fluent English. Come on, how hard can it be?

Bear with me while I try to explain why it's anything BUT simple. Note that this will be a non-technical post: if you have some knowledge of natural language processing you'll be better off reading this technical paper. Now, first of all, in order to be able to offer perfect correction, we need to have some computer understanding of human language. We require some magic algorithm that truly understands what you want to say in the first place, and then puts it into nice prose. But despite the claims you often find (mostly in commercial applications), no such thing as language understanding by computers currently exists. Not even for well-formed and well-structured English - unless you define "language understanding" as something that has little to do with the common use of the phrase.

So what about just targeting sentences with a single/simple error? Again, we're in very difficult territory. What is a mistake, how many types of mistakes are there, and how do you detect them? One frustrated user of our service observed that "This sentence lots of mistakes contains" does not trigger any suggestion. The mistake in this example is a mix-up in word order: the verb "contains" appears at the end of the sentence, but it should occur after the subject "this sentence". We could target this kind of mistake by looking at misplaced verbs and then checking if the sentence gets "better" according to an automatic score if we move the verb. But why should we target this kind of mistake? This is an artificial example that has little to do with the errors that non-native speakers really make. As you can imagine, you have to be fairly selective about the errors you try to fix, otherwise just about any kind of word order change/word insertion/word deletion and any combination thereof needs to be considered as a viable alternative for every sentence. Sorry, but not all errors can be dealt with. But some of them can, and here is how we try to deal with them.

First, we identified a list of common and typical errors that actually occur in written English produced by non-native speakers (of East Asian native language as a starting point). We did that by reading through error analyses produced by other researchers, and we did our own analysis on some real-life data.

Second, we investigated the different error types to see what kind of technical solution works best: rules, machine-learning (or a mix), and we designed different error modules geared towards the different errors.

Here are two examples to illustrate the different techniques and levels of complexity. Some non-native writers of English have trouble with English verb morphology. After all, if it is "I kicked the ball" it should also be "I hitted the ball", right? All we need to fix this type of error is to look up incorrectly inflected irregular verbs in a relatively small list, and suggest replacing them with the irregular form. A small rule will do to detect this error and make a good suggestion.

On the other hand there are errors like the use of determiners ("I am teacher from city" versus "I am a teacher from the city") and the use of prepositions ("in the other hand"/"on the other hand") where it becomes impossible to list all possible errors and corrections. For this type of error we decided to use machine-learning techniques. We feed the machine with millions of sentences and the contexts for determiners and prepositions, and let it figure out the patterns by itself. At every beginning of a noun phrase, the machine extracts several words and part-of-speech tags (verb, noun, adjective etc) to the right and to the left, and based on these many million data points from millions of examples it produces statistical generalizations. For example, it learns that if you start a sentence with a preposition followed by "the other hand", it is more likely to have the preposition "On". But if you enter "I hold an ace in the other hand", the probability shifts, and the preposition "in" becomes more likely. Statistical models like these have the very nice property that they can discover the patterns present in a large collection of text, without being explicitly told what to look for. The downside is that they are only as good as the data they have seen. If confronted with unknown words, misspellings and unusual language, they start to make mistakes. That's why you might sometimes see a suggestion appear or disappear when you make a seemingly innocuous and unrelated change in the sentence.

Finally, we know from proofing tool studies that there is nothing worse than flooding the user with incorrect suggestions. In order to cut back as much as possible on these so-called "false flags", we also use a language model as a filter. Think of a language model as a large table of words and word sequences and their counts/probabilities. A language model is "trained" on a large collection of sentences (containing billions of words in our case). When you present it with a new sentence it will be able to assign a "goodness" score to that sentence, based on all the words and word sequences it has seen in the training data. We use the language model to only show a suggestion to the user if the score of the correction is (much) higher than that of the original. Which unfortunately also means that sometimes we suppress perfectly good suggestions just because the system is not quite sure enough.

We can't do magic, but maybe we can still be of some help with a set of common errors us non-native speakers frequently make. A modest goal, but it's a start.

... a modest goal ...

Additional resources