Questionable Character

In Vista, we're doing put a huge emphasis on gauging and improving the functional, cosmetic and linguistic quality in the localized versions. We're using several different strategies to find & fix old mistakes and prevent new ones.

Right now, I'm playing around with extracting strings that contained unexpected characters. The idea goes something like this: if we know which characters are expected for a language (e.g. a-z for Dutch; a-zåäö for Swedish, as well as numbers and symbols), then we can scan all translations to find those that contain any character that is not in the expected list or in the source text (this is to lower the number of false positives). We then log those strings and browse through them to see what the story is.

It's a simple idea, and as such it suffers from several flaws. For one, there's no built-in way in Windows to see whether a letter belongs to a certain language (for good reason), nor can you assume that the translation of a certain string should actually match the source text (localizing software is hard).

Still, I just tried out this simple idea on Russian based on the most commonly used letters, and I did get decent result. Many false positives were logged (gots to tweak the code a little), but there were a few really nice ones were caught too. One example is a resource that's used when dumping your netsh configuration to a script:

English text:  # End of Remote Access NBF configuration.
Translation:   # Конец конфигурации удаленного доступа AppleTalk

Hm, I wonder what caused this. Maybe fuzzy recycling, maybe manual copy/pasting, maybe something else...?

This is just one simple technique we're toying with, and something we'll probably only use for a few languages. I bet it's pointless to try and run this on Swedish, since the Swedish and English alphabet are so similar, and I bet it'll be very hard to come up with a list of expected characters for some Asian languages. Still, I'll give it a go on Greek and Turkish next, just to see what'll pop out...


This posting is provided "AS IS" with no warranties, and confers no rights.