Localization & misspellings

So Swedish XP SP2 has been available from the Download Center since Monday. On Monday I proudly announced the URL as soon as I saw that it was live. I felt really good about this release. I've spent a lot of time on it, tried hard to get the Swedish version to look good and read nice.

Five hours later, Svante said (my translation): "First thing I see after installing XP SP2, rebooting and logging on is a spelling mistake!"

Indeed, at the bottom of the Security Center:

See how it says "sekretessspolicy" - three 's' in a row. Ouch. I don't feel as cocky anymore.

How did that happen? I've spell checked all new and changed resources at least twice. I've been running on SP2 since at least March. I've tried to view all UI at runtime, and I know I've looked at this dialog box probably a hundred times. People at Microsoft in Stockholm have run SP2, filed bugs on translations in the exact same dialog without noticing - and I have fixed those bugs without noticing this problem.

I guess one simply goes word blind after a while, looking at the same strings and the same dialogs time and again...

So what do I do now? I can try and get the string fixed in SP3, but I'm not sure I'll succeed. Also, doing so only addresses the symptom. I need to fix the underlying cause - prevent spelling mistakes to get into the product at all, or at least catch them before they get into a build.

I'm not sure exactly how to do this yet, but here are a few things I've thought of over the last few days.

First problem, my uncoordinated fingers. I'm not hopeful about fixing this. I've tried changing, but I still write "anvnädare" instead of "användare" and "urringning" instead of "utringning"...

Second problem, spell checking several hundred thousand words is error prone and tedious.
Right now I spell check like so:
1) First copy all the strings I want to spell check into Word.
2) Then search and replace to remove a lot of gunk - like change "\r\n" to "^l", change "\t" to "^t", get rid of HTML markup etc.
3) Start spell checking.
4) For any error found, fix in my localization tool.

This is tedious as there's no way I can remove all gunk I should. Because of this Word stumbles on a lot of things that are OK, and so it's easy to oversee an actual misspelling.

For my next project I'll try a few things -
Create a script that dumps out all strings into a text file, cleans up by removing as much gunk as possible, and writes out just a list of unique words. I'll then start by spell checking only this list. This should cut down the amount of words I need to spell check initially. Also, if I'm clever, I can make it remember which of the individual words were false positives and which were genuine misspellings. Next time around I can then exclude the false positives from the word list, and I can find the known misspellings without even having to fire up word.

Another approach is to scan this word list for illegal character combinations. For instance, there's no Swedish word with three 's' in a row. If I had done this during sp2, I would have caught the error Svante found. The only problem with this kind of rules is that it'll give false positives, but I could probably make provisions for that. (RElated to this kind of text is scanning for sentences that start with two capital letters, words that that occur twice in a row and other such easy-to-make mistakes.)

A variation on the word list script would be to create a sentence list. This would allow me to benefit from the grammar check in Word as well, and coupled with a known good/known bad list could help us improve consistency on a sentence level.

Third problem is that I looked at the same dialog a hundred times without seeing the misspelling. Again, I'm not sure I can fix my eyes. I guess we need to look more into getting more people involved in running the builds before release. There are beta program for some languages, but they typically don't give much linguistical feedback. I suppose that could be sorted though, if we managed to give builds to the right people.

Then again, it could be that I'm overstating the problem just because this one missspelling happens to be in such a visible place. I know that we've improved dramatically since NT4 and Win9x. But the only way to know how bad the situation is, is to try and find out what else I've overlooked...

I've got to think some more about this topic. I'll be back...