Localization & Consistency

Our localization tool has a feature for performing inconsistency checks. It's not so good.

It works roughly like this - it enumerates all English strings, and for each string it checks if there's more than one translation. If so - there's an inconsistency. It is then my job to go through this list of inconsistent translations, find out which are by design, and fix the rest by picking the best translation.

The problem with this approach is that most of the short strings that are inconsistent are so for a reason - something about grammar or semantics. And most of the long strings that are inconsistent, well, who cares? As long as the right message is conveyed, it typically doesn't matter.

To make matters worse, blindly consistifying strings will cause bugs in Windows. For example, to support upgrades we have plenty of resources that represent legacy time zones, folder names and Start menu entries. If I give these consistent translations, all kinds of things will break.

This whole way of doing inconsistency checks is flawed. It's focusing on what's easy to find instead of what's important to find. So, to work around this we have a random set of other scripts & tools, each focusing on their own aspect of consistency.

For instance, certain files often have references to other strings within the same file. ADM files (group policy descriptions), MFL files (descriptions of classes, methods, properties etc you can use through WMI) and Performance Counters are good examples of this. Many policy descriptions contain the text "see also the setting X", where X is its own string somewhere else in the same file. So - one classy localizer wrote a script that enumerates all resources in one of these files, sorts them by length, and for each string checks 1) if the English text of the shorter string is contained in any longer string and 2) if so if the translation in the shorter string can be found in the translation of the longer string. If 1 is true but 2 isn't, then we probably have an inconsistency. You will see some false positives, but not too many. You might miss out on some references though, if something is misspelled in the source.

This check is pretty handy, but it doesn't scale over an entire product. Even though you have similar strings elsewhere ("Open Control Panel, click Add or Remove Programs, and then click Add/Remove Windows Components"), running the same check across all files gives way too many false positives. What's the solution here? Well, so far we've been very much relying on each localizer knowing their product enough to know what Control Panel is actually called...

This isn't perfect, granted. My memory is poor, and my spelling is weak. One approach that can be better is to maintain a terminology database which says "Control Panel" should always be translated into "Kontrollpanelen" or "My computer" can never be translated into "Denna datorn". You then iterate through all strings in a project, check all of these rules, and log the items that fail. This scales nicely as you can cover far more than just UI references and it doesn't cost too much to maintain (unless you change terminology a lot). There are some problems though; you need to figure out what to do with false positives for those cases where you've deliberately broken a rule. Ideally you don't want to see them again. Also, I'm not sure that this will scale for all languages -- if you have a high number of possible translations for a term depending on context, it might be hard to write good rules. And of course, it's only as good as the source data -- if "Cnotrol Panel" is misspelled, your rules might not help. Still, it’s a good deal better than nothing.

Another approach to is to in each such string hunt down which resource it actually points back to, document this somewhere handy, and then run a cross check to make sure the references are correct. What's good is that as long as the references are correct, it will give a very accurate result. What's bad is that this has a higher cost to maintain; finding and keeping references correct is expensive in the long run. Also, it won't necessarily help with general terminology management the way the check list above can. Because of the drawbacks, we mostly use this for strings that have to match for functional reasons.

What I'd like to see some day, is that the developers solve the UI reference problem for me. Instead of having strings contain the name of UI elements, look up the actual name used at runtime. It's already done for things where the name is unknown until runtime ("File %s not found"), so it can of course be done for other strings too. That could be handy for MUI/LIP scenarios too, where only a part of the software is translated. ...I don't think this will happen any time soon though...