Recycling: The problem with auto-translation


Benefits of auto-translation

When we start localizing a new version of Windows, we obviously don’t start from scratch. Instead, we try to recycle as much as we can from previous versions and other projects.

There are several benefits of recycling
– You can get better consistency in your localized product
– If you’re outsourcing the localization work, you can avoid paying for work that has already been done
– If you’re me, there’s only so many times you can translate “Click next to continue” before you go bonkers…

Recycling is done in several different ways, and one of those ways is through auto-translation. Auto-translation works something like this: For each string that needs to be localized in the new product, a tool tries to find a matching string in a set of glossaries. If a match is found, the translation in the glossary is copied into the new product. As you see, auto-translation is not the same as machine translation — it’s simply a way to reuse previous translation work.

In addition to the benefits above, auto-translation brings some more to the table. Auto-translation works well for recycling across files/projects — you can for instance use our glossaries as a base for auto-translation. Also, auto-translation can be, well, automated. It’s therefore quite tempting to try and auto-translate every project from all kinds of sources, to get as much “for free” as possible.

Bad idea. There are problems with auto translation.

Problems

The first problem is ambiguous sources. Whatever glossary you use, it’s likely that it contains inconsistent translations. Some of these will be intentional, some not. How will the auto-translation algorithm know which item to pick if several possible matches are found? Will it pick the most common translation? Will it skip this item? And after auto-translation is done, how will the result be reviewed? If the algorithm just picks one item, will the reviewer have to look up other possible translations? Doing so undermines the cost savings of auto-translating. If the algorithm skipped a string, how will the reviewer know what strings are left but can actually be found in glossaries? If the reviewer doesn’t know that a term was skipped because there were two inconsistent, but equally valid, translations, how can you avoid introducing a third inconsistency to the mix?

The second problem is context. Regardless of the glossaries you use, your project probably contains strings that can have more than one translation depending on context. Is “volume” related to sound or to disks? Does “female” refer to the gender of the user or to what a plug looks like? Even if your glossaries only contain “known good” translations, you can’t predict how new strings are being used. Also, it’s likely that some UI elements need different translations depending on what type of control they’re being displayed in.

The third problem is accuracy. Your auto-translation tool may have settings that affect the result, such as how close a match needs to be, whether capitalization is important, if decorations such as ellipses and hotkeys are stripped out when matching and if the match should take resource type into account. If your auto-translation tools assigns hotkeys arbitrarily, you can expect to have duplicates all over the place after auto-translating. This will take time to clean up. And if your tool has problems handling nulls, you can even end up seeing random crashes in your localized build.

Examples

Here’s an example that’s pretty embarrassing. This is from Swedish XP, the dialog where you can activate your installation over the phone.

This was not caused by me not knowing my ABC. This was caused by auto-translation. The problem is that we have a lot of HTML pages with hotkeys. HTML hotkeys are created by using the AccessKey attribute on e.g. a button. The value of this attribute is included in my project so that I can change the hotkey to a character that’s actually used in my translation. Unfortunately, this string “F” was auto-translated into the hotkey for some other element. Less than stellar.

Here’s another ugly that was probably caused by auto-translation. This dialog box is taken from Swedish Windows 2000, from the Regional Settings control panel applet. When you use the Spanish locale, you can pick sort order:

In this dialog box, “Traditional” has been translated into “Traditionell”, which is correct. “International” has been translated into “International phone calls”. Whoops.

Finally, here’s a classic mistake from Windows 98, something I was recently reminded of in a Swedish forum.

The theme “Space” has mostly been correctly translated as “Rymd” (outer space). Except of course for where it got translated into “Blanksteg” – the space key…

Tips

To try and avoid issues like these, I’m recommending a few things –

Don’t auto-translate short strings. In my experience, short strings make up for the bulk of the resource count, but only a small part of the overall word count. The longer a string is, the more accurate auto-translation will be (context & semantics is less of an issue). Therefore, simply do not auto-translate strings that are shorter than, say, 20 characters. Sure there will be a bit more to localize, but you’ll save time on reviewing auto-translation and you run less risk of overlooking mistakes.

Maintain a standard UI glossary. If you have a glossary with the standard strings, like “Browse…”, “Back” or “Add…” and always use this to auto-translate UI only, you can save yourself a lot of time reviewing strings later.

Use cleaner glossaries. If you control the glossary you use as a source, then you can exclude ambiguous strings. It’s not always easy to predict what’s ambiguous though, and the glossaries might not be under your control.

Take time to review auto-translations, and make sure to have natives test your product to uncover linguistical mistakes. If you’re localizing software and UA separately, it might be worth letting your UA people hack away at the software for a few days before they start UA. This can help bringing the two teams closer, it can help you find a lot of bugs that otherwise will impact UA, and can also give those who will localize your help files a chance to get to know the product up front. Result – everyone’s happy.

Finally, please please please make time to fix up old mistakes before shipping the next version. Your customers will love you for it.

Comments (12)

  1. Joe says:

    Reminds me of a French calendar I saw once:

    janvier

    février

    mars

    avril

    peut

    juin

  2. The Blanksteg screensaver is a real classic 🙂

  3. Hey, you should see the examples I didn’t include… 😉

  4. Andreas says:

    I installed NIS2004 for a customer today, and noticed this nice box "Do not visa den här dialogrutan igen". Swenglish. 😉

  5. Great blog Jesper. I look forward to hearing more from you.

    We have just started translating our application, and one of the major "challanges" that we see in the future is with recycling as much as we can for future versions of our software. We haven’t looked into auto-translation yet, but we are keeping it in mind.

    I have a few issues that I would very much like your comment on. Perhaps as a topic for a future blog.

    We have been looking into using Winres.exe to translate our Windows forms. But we are very conserned as to how to handle future changes to our forms. If we, say, add a new button to an existing form, we would like to just translate this new button, and not the entire form again. Is that possible with Winres? How do you handle these problems at Microsoft? Do you translate the entire form again?

    As far as I can see, Visual Studio.Net can handle translations of new versions of the form, but then we would have to ship our source code to the translators, and we are not interrested in doing that.

    Do you know of any other tools that can handle this issue with recycling as much as possible from old forms?

    Regards

    Anders K. Olsen

    Denmark

  6. I don’t think winres helps you with recycling and manage update cycles at all. Winres is great for previewing and sizing dialogs, but I’d be hard pressed to consider it a full localization tools. There’s simply too much missing. (That said, it can make a great addition to a localization tool.)

    We obviously don’t hold off localization until the product is ready, since we want to ship all languages as soon as possible. Since we start localizing before the product is done, we see changes like what you describe – a button added or removed, a label moved or resized, various text changes… How to handle this, well it’s very much up to the tool you use.

    The tool we use internally works pretty well for this type of churn. It can distinguish between resources that have been translated and resources that have been sized; it can keep track of what checks I’ve run on resources (spell check, terminology review etc); and when I update to a new build level, it can show me exactly which resources have changed on the English side so that I can make the same changes in the Swedish resources. This way I rarely have to translate the same form twice. Typcially I only see the delta. Even if all resource IDs would change in a file we can recover the translation work – although I might have to resort to iauto-translation.

    There are still some holes in the process — e.g. it could be easier to ensure that only "approved" resources are changed by the localizer (this is important for Service Packs, or when you’re close to release) — but overall it works well.

    I don’t know how much you’re willing to invest in the engineering process, but here’s one possibility to solve change management: Say that your localizers work on resx files. If you save source resx files somewhere handy every time you send out an update, you could write a tool to diff these resx files and produce a log. The same tool could take the changes in the source files and merge them into the translated files as appropriate (this would take care of adding new controls, removing gone controls, and possible "pre-size").

    You then send the "hacked" translated files out to your localizers, together with the log file that describes exactly which resources have changed. Your localizer changes text as appropriate, sizes again, etc. and sends back the files to you.

    When receiving files, you can diff these translated files with the ones you received last. The changes should roughly match the changes you saw when diffing the source files.

    What’s cool is that the resource manager in .NET Framework makes this entirely doable. What’s uncool is that you’d have to create this yourself 🙂

    As for other tools, I really don’t know of any. The only localization tool I know well is the one I use at work, and unfortunately that’s not freely available. In the future though, I hope that we can do more to support .NET localization. In fact, I’ve been toying with the idea of writing a li’l loc tool in my spare timne and put it on Gotdotnet for anyone to use. If I actually get any spare time to do this, I promise I’ll blog about it.

  7. Thanks for the answer Jesper.

    We had been considering something along the lines of what you suggested. We also know that winres is not a full localization tool, but since our project is nowhere near the size of Windows or Office, I think we will start off with winres. At least we can get some experience with localization, and get some idear of what features we may need if we buy a full localization tool in the future.

    Since I started looking into localization, I have also been toying with the idear of writing my own tool. So if you ever get around to writing your tool, let me know if you could need some feedback or help (akol_dk@hotmail.com).

    Regards

    Anders

  8. Anders, sure I’ll keep you in the loop. I started doing som initial investigation this weekend, and it seems perfectly doable. Be warned though, I’m not trying to make anything that’s better than any commercial tool out there. I’m just trying to make something workable, that I can use as a reference when blogging about problems/ideas for localization. I’ll probably place it on gotdotnet for anyone to download and modify.

  9. Your description of a tool for reusing translations across minor changes (e.g., resizing only) reminded me of RLTools. It worked on PE32 resources (compiled .rc’s), and it did a decent job even on combo box initialization data (that other tools missed). Most versions were internal to Microsoft but one version did come out as a free unsupported tool (http://www.google.com/search?q=rltools+localization). The intermediate files that it created were easier to parse than rc files and that made it very useful as a building block for my homebrew rc handling tool

    I’d be curious to know if there is some descendant of RLTools still being used/developed inside Microsoft

  10. Eusebio, I can’t find the tool you mention. The closest I got was a reference to KB Article Q110894, but that one doesn’t exist any more. I also found a dead link to the download.

    I can say though that the tool we use happily works on compiled code — no RC files needed.