printf-style place holders, #2

It's been a while, but in the last post, I showed what a printf format specifier is, how the localizer can spot it and what can happen if it gets messed up during localization. Today, I'll talk about more about what the effect can be if they change during localization and how developers can help ensuring it doesn't happen.

What actually happens if you change a print format specifier while localizing depends on exactly what the change is. Last time, I showed that it can cause an application to crash. Pretty bad, but that's not always the effect.

Take a look at https://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/_crt_format_specification_fields_.2d_.printf_and_wprintf_functions.asp. This page lists all the data types that printf supports. You can divide them into four groups - int, double, pointer and string. If you'd accidentally substitute one type of int specifier for another, the effect isn't dramatic - the only difference is how the value is displayed. Imagine for instance this piece of code:
char message[50] = "Please enter a value between %d and %d.";printf(message, 100, 400);

This would print out the following to the console:
Please enter a value between 100 and 400.

Now, imagine that the message is localized into the following, where the format specifiers are using a different data type:
char message[50] = "Please enter a value between %x and %x.";printf(message, 100, 400);

This would now print the following instead:
Please enter a value between 64 and 190.

As you can see, the message displayed to the user is misleading - the values are printed as hexadecimal numbers instead of decimal integers. Not great, but hey, at least the application didn't crash.

Of course, as we saw the last time, if you substitute across groups, you're a fair bit more likely to crash the application. This code for instance is bound to get the process to go belly-up:
char message[50] = "Please enter a value between %d and %p.";printf(message, 100, 400);

By now I think we can agree that mismatches are generally bad. It's very rare for intentional mismatches, although there are some cases where a placeholder might be dropped. German wizards, for instance, don't say "Welcome to the [wizard name] wizard", they simply say "Willkommen". And the Help menu in Dutch Windows would just say "Info" instead of "About [application name]". Apart from these rare exceptions, mismatches should just not occur. So how come I'm making a big deal about it, how come I even write this?

Well, turns out, unintentional mismatches aren't uncommon enough. There are several ways mistakes can appear during the localization process - maybe you're auto-translating from poor sources or you allow less than perfect matches when auto-translating. Maybe the localizer simply mistypes a placeholder. Maybe a translation memory application doesn't understand placeholders. Maybe the localizer isn't experienced enough to understand what the placeholder is. This is especially likely if the placeholder looks slightly unusual, such as in "%s's Documents".

Another very common cause is that the localizer switches the place holder order for linguistic reasons. Their need to do so may be genuine, but unfortunately it may lead to bad bugs. For instance, the string "Property (%s) has Value (%d), which is out of the legal range for this property." might become "La valeur (%d) de la propriété (%s) est en dehors des valeurs possibles pour cette propriété." in French. Language-wise it might be splendid. Functionality-wise, not as great.

So, if you're authoring strings that will need to be localized, please keep these risks in mind. You have the power to prevent a lot of issues, simply by making the string localization friendly from the start. The more placeholders you include in a sentence, the greater the risk that it'll break for some language. Consider how you can bullet proof the string from the start - maybe change the example above to "The property %s has an invalid value. Value: %d".

In my team, we typically treat any format specifier mismatch as a high severity bug. The cause and the potential effects are well understood, as are the risks of taking such a fix. During the localization process, we run checks that scan through all resources and compare the source and the translation to find any mismatched printf format specifiers. Any unintended mismatch will be fixed before release.

That's it for printf. Next time I'll step it up a bit with the FormatMessage function.


This posting is provided "AS IS" with no warranties, and confers no rights.