Keep your eye on the code page, practical exam


The instructions that came with my new monitor are printed in several languages. One of those languages is Polish, or at least it would be Polish if... well, you'll see.

UWAGA: Szczegó³owe informacje dla u¿ytkownika znajduj¹ siê na do³¹czonej p³ycie CD.

This is garbage. What they meant to write was

UWAGA: Szczegółowe informacje dla użytkownika znajdują się na dołączonej płycie CD.

(It means, "Note: Detailed user information is included on the CD.")

What went wrong? Why did all but one of the accented characters come out corrupted?

The company responsible for the instructions failed to keep their eye on the code page. The text provided to them by their Polish translator was in code page 1250 (Central Europe). If you hold up that code page diagram next to code page 1252 (Latin I), you can see where the problem is. The company took that string in code page 1250 and printed it in code page 1252. For example, the "ł" character is at position B3 in code page 1252; the character at that position in code page 1250 is "³".

I just hope their Polish customers can figure out what the text is supposed to say.

Okay, that was their practical exam. Here's yours:

I am running on a Spanish install of Windows 2003 and calling

WideCharToMultiByte(CP_OEMCP, WC_NO_BEST_FIT_CHARS,
                    pwzStr, -1, pszStr, nRet, NULL, pbUsedDefault);

When I pass this Unicode string

D:\Documents and Settings\ABC\Configuración local\ 

it returns the multibyte string below and doesn't set pbUsedDefault:

D:\Documents and Settings\ABC\Configuraci¢n local\

Why isn't pbUsedDefault set?

In fact, I'm going to add two bonus questions: (1) How should this customer change their code to get the path name converted correctly? (2) Why? The answer to the second question is the important one.

Comments (24)
  1. Anonymous says:

    He should call SHGetSpecialFolderPath with CSIDL_APPDATA as the folder parameter. Why? Because in different language installations the directory is called something different. Even when you think "ah, it will only be used in Spain, inside this very company" there will be someone who has their computer installed with English language Windows or something and your program may well not work correctly.

  2. Anonymous says:

    The conversion succeeded: oacute (U+00F3) translates to 0xA2 in code page 850 (OEM Latin 1).

    bonus 1: The problem is that he uses the wrong wode page : CP_OEMCP instead of CP_ACP.

    bonus 2: When he passes the string to a Windows ANSI API, Windows converts back to Unicode using the current ANSI codepage (default = 1252 on a Spanish box) -> 0xA2 in codepage 1252 = cent character, as your display shows.

    If he wants to pass the string to a Windows API, he should use CP_ACP, which will convert to code page

  3. Anonymous says:

    The trick is in the phrase "it returns the multibyte string below" : it does not make sense to talk about a "multibyte string" without mentioning in what codepage it is encoded.

    What your customer probably meant to say "when I pass the output of WideCharToMultiByte to MessageBoxA(), here’s what I see on my screen".

    The fact that it does not match what they see when calling MessageBoxW() with the original Unicode string is caused by the fact that MessageBoxA expects ANSI-encoded text, but this particular call to WideCharToMultiByte produces OEM-encoded text.

    So to answer the questions :

    (0) pbUsedDefault is not set because the conversion went fine

    (1) they should not change their conversion code, it what they want really is an OEM string !

    (2) they got confused because they did not keep their eye on the code page :-)

  4. Anonymous says:

    Here’s my theory:

    Windows multibyte APIs (the ones with an ‘A’ suffix) assume the system’s ANSI CP (CP_ACP) to be used for strings. The given code converts the unicode encoded path to CP_OEMCP, the system’s OEM codepage. OEM codepages are those used by DOS and with the FAT filesystem (I guess NTFS is unicode?).

    (0) pbUsedDefault is not set, because there is an "’o" (sorry, german keyboard. nodeadkeys) character in the OEM codepage and it was correctly converted to it. However, I guess you are using an ANSI API to output the converted string.

    (1) The customer should pass CP_ACP to WideCharToMultiByte,

    (2) because then he’ll get an ANSI string, which is the correct encoding for Win32 multibyte APIs.

  5. Anonymous says:

    Answer: pbUsedDefault isn’t set because no default characters were used, every character in the input string was available in the target encoding.

    1a. They /should/ switch to writing Unicode software, use UTF-8 and probably abandon Win32.

    1b. However they’re more likely to change CP_OEMCP to CP_ACP and recompile.

    2a. It’s 2006 already. Who wants to still debug code page problems a decade after they became irrelevant?

    2b. They’ve told the hopelessly overloaded WideCharToMultiByte function to convert to an OEM character encodings, in this case probably OEM 850 or OEM 858. But their actual display encoding is probably ANSI 1252, so the resulting string is "correct" but it’s useless and appears wrong. CP_ACP tells WideCharToMultiByte to use the ANSI codepage. This part of the function is not too buggy so it will probably work.

  6. Anonymous says:

    "D:Documents and SettingsABCConfiguración local"

    looks more like an spanglish installation =oP

  7. Centaur says:

    I hope that monitor instruction doesn’t have a section in Russian…

  8. Anonymous says:

    [I don’t want to look like I’m feeding a troll. But…]

    Nick,

    2a: If nobody wants to debug such problems, why did you debug this one ? ;-)

    Now, some people don’t really have a choice. Believe it or not, not everyone writes software that talks to Windows only. I’ve been said that even in 2006, there are still quite a few devices out there which are not Unicode aware (and will not be in the next decade). And there are people who write Windows software who need to talk to such devices (I’m in that crowd). And if you ask them, they will tell you: OEM code pages issues are everything but irrelevant.

  9. michkap says:

    It looks like the flag(s) to actually do something with those default character parameters is not passed? :-)

  10. Anonymous says:

    Michael,

    The docs are not very clear about this:

    <quote>

    lpUsedDefaultChar:

    Points to a flag that indicates whether a default character was used. The flag is set to TRUE if one or more wide characters in the source string cannot be represented in the specified code page.

    </quote>

    These 2 sentences contradict each other if there are chars that can’t be translated but WC_DEFAULTCHAR is not specified.

  11. Anonymous says:

    Serge:

    I didn’t debug this code, I just wrote out the answers to Raymond’s questions. The person quoted by Raymond is debugging, and most likely they shouldn’t be, because as we’ve seen they’re trying to use this string with the so-called ANSI Windows APIs.

    I’m well aware that most software doesn’t talk to Windows. I’ve never taken a job where I wrote Windows software, and on every occasion that I’ve had to program for Win32 (e.g. to help a friend) I’ve found the experience unpleasant and not to be recommended. – to forestall your most likely next question, Raymond’s articles are generally interesting regardless of whether I like Windows.

    Michael:

    When lpDefaultChar is NULL and lpUsedDefaultChar is not, the system default character is used. The WC_NO_BEST_FIT_CHARS flag is passed in the example.

  12. Anonymous says:

    WriteConsoleA would have printed the string correctly.

  13. Anonymous says:

    8-bit characters?  How quaint.

  14. Anonymous says:

    Reminds me of this story:

    http://community.livejournal.com/velik_moguch/242083.html

    Short summary: a Russian girl asked her French friend to send her a book and wrote the address in an email. The French, not in the least surprised that Russian uses mostly accented Latin vowels, carefully wrote it down on the envelope.

  15. macbirdie says:

    A few wrong characters because of a bad codepage is not even merely bad compared to what I’m sometimes seeing in Polish translations of hardware manuals. ;)

    [Not that many of them are worth reading anyway]

  16. Dean Harding says:

    Nick: You’re right in that they should be writing Unicode software, but why abandon Win32? Those are two totally orthogonal suggestions, and the second has nothing to do with this discussion.

  17. Anonymous says:

    > I just hope their Polish customers can figure out what the text is supposed to say.

    Well, speaking for myself, I got used to this ³<->ł ¹<->ą mixup and my internal parser makes the "best match" automatically. The problem is:

    1) It is slow ("Detail" should be "Szczegół", not "Szczegó³", although this scores 7/8, so is quite easy)

    2) It shouldn’t be used at all -> this kinda mixup only shows me that I don’t want to buy anything from this company, but…

    3) Cmpanies doesn’t really care (which is wrong) and we are used to this (e.g. nobody protests -> and *that* is the mistake)

  18. Anonymous says:

    Is there a technical reason for why there is no (SetACP) function in the winapi to set the active codepage on a per process (or thread) basis?

  19. Anonymous says:

    The fun starts when you are writing a console-mode application.

    – You need to call SetFileApisToOem() for the file APIs.

    – You need to call setlocale(LC_ALL, ".OCP") to set the OEM codepage for the locale functions.

    – You need to call _setmbcp(_MB_CP_LOCALE) to adjust the multi-byte string functions (_mbschr, etc) for the OEM code page.

    – Then you need to work around the bug that mangles the command line argv arguments.  (CharToAnsi is wrongly called on them.)

    I wrote a standard library routine that does all the above for my console apps: OemCodePageHell().

    Microsoft wrote their own workaround for their console apps and put it in ULIB.DLL.

  20. Anonymous says:

    Serge Wautier won.

  21. Dean Harding says:

    asdf: You can sort of do it with SetThreadLocale, which actually changes CP_THREAD_ACP. However, it’s pretty ugly.

    CP_ACP is the *system* code page for a reason: changing the code page affects a lot more than just what WideCharToMultiByte would do. It also affects things like resource loading (and then what happens if you load a resource, change the thread locale, then load another resource? They don’t match anymore! And it’s even weirder with the way Windows caches various resources [like dialogs]).

    Besides, CP_ACP is based off a user’s preferences – they’ve said "I understand Polish, and I like my interface to be in Polish, please." You shouldn’t go around trumping their preference.

  22. Anonymous says:

    Gideon, is your library open sourced? If so, where can I find it?

    I only knew about SetFileApisToOem and SetFileApisToAnsi, and didn’t know setlocale is in msvcrt.

    Can’t find much info on ulib atm. I’ll look into it if I find anything.

  23. Anonymous says:

    The library is closed source (sorry).  However I implemented the same OEM code page tricks for my Windows ‘ls’ console utility, http://utools.com/msls.htm.  It is open source under the GNU GPL license.

  24. Anonymous says:

    Thank you very much!

Comments are closed.