Why does misinterpreting UTF16-LE Unicode text as ANSI tend to show up as just one character?


If you misinterpret ANSI text as Unicode, you usually get nonsense Chinese text. If you misinterpret Unicode text as ANSI, why do you usually just get the first character?

Okay, this one is a lot easier.

The Latin alphabet fits in the range U+0041 through U+007A. If you're using the UTF16-LE encoding (which is what Unicode means in the context of Windows), then the first byte will be the correct character, and the second byte will be zero, which will serve as the string terminator.

For example, (char*)L"Abc" will act like "A".

I remember looking at the registry and finding a registry key directly under HKEY_CURRENT_USER called simple S. In other words, the program stored its settings under HKEY_CURRENT_USER\S.

This bugged me enough that I dove in to figure out how this happened.

The program in question had a Windows 95 version and a Windows NT version. They compiled both versions from the same code base by using the TCHAR-style functions, so that when compiled for Windows 95, it was an ANSI program, and when compiled for Windows NT, it was Unicode.

The program came with a helper DLL, which was also compiled as ANSI for Windows 95 and as Unicode for Windows NT. The name of the DLL was not inside an #ifdef, so even though the code was compiled twice, both versions of the DLL had the same name.

Furthermore, the .def file and the internal library's header file did not contain any #ifdefs either. So the Windows 95 version of HELPER.DLL had an exported function called CreateRegistryKey (say), which accepted an ANSI string. And the Windows NT version of HELPER.DLL also had an exported function called CreateRegistryKey, but which accepted a Windows NT string.

The problem was that their Windows NT product shipped with the Windows 95 version of the helper DLL!

Since the DLL name was the same, and the function names were the same, the operating system happily loaded the DLL and imported the function name successfully, even though it was the wrong function.

As a result, the Windows NT version passed a Unicode string to a function that interpreted it as an ANSI string, and the registry key name Software became misinterpreted as just S.

There are a few ways of avoiding the problem.

The obvious one is to abandon the Windows 95 version of the product. Because c'mon now.

Okay, but let's go back in time to a period when supporting Windows 95 was still a reasonable thing to do.

One option is to give the Windows 95 and Windows NT versions of the DLL different names, say, HELPERA.DLL and HELPERW.DLL. That way, when a program linked to HELPERW.DLL but you accidentally put HELPERA.DLL in the product directory, you would get a "DLL not found" error instead of running ahead with the wrong DLL.

Mind you, this solution would catch the problem only if it occurred at packaging. But if the problem was that the code linked together some object files compiled in ANSI mode and some object files compiled in Unicode mode, say because you used the wrong version of a static library, then the error would go undetected because both sets of object files will look for the function CreateRegistryKey, and if the module was linked with (say) HELPERW.LIB, then both sets of object files will link to HELPERW.DLL, even though half of them thought they were linking to HELPERA.DLL.

What they should have done was change the names of the exports. Export two functions CreateRegistryKeyA and CreateRegistryKeyW. Use an inline helper function or a macro in the header file so that ANSI clients are directed to CreateRegistryKeyA and Unicode clients are directed to CreateRegistryKeyW. The implementation of the helper DLL need only implement the versions of the functions corresponding to the desired character set. In other words, HELPERA.DLL implements CreateRegistryKeyA and HELPERW.DLL implements CreateRegistryKeyW. (If you use macros, then this happens automatically when you implement CreateRegistryKey.)

This design solves a few problems.

  • If you package the wrong DLL, the file names will not match and you'll get an error at load time.

  • If you have a mix of object files, you will get a linker error because HELPERA.LIB won't have entries for the Unicode versions, and vice versa.

  • If you really needed to support the mixed version, you could link to both HELPERA.LIB And HELPERW.LIB. Each object file will pull the function it needs from the appropriate import library, and will bind to the corresponding DLL at runtime.

  • In the future, you might decide to merge the helper libraries into a single helper library that supports both character sets. Giving the functions distinct names allows this to happen. (This is what most of Windows does. For example, kernel32.dll contains both ANSI and Unicode implementations of many functions, distinguished by function name.)

Moral of the story: If two functions are different, give them different names. (If you use mangled names, then the names will already be different due to different mangling.)

Related: What is __wchar_t (with the leading double underscores) and why am I getting errors about it?

Comments (7)
  1. skSdnW says:

    “This is what most of Windows does” parts of kernel32 and shell32 are weird, they export A and W functions but also a non-suffixed version that is the same as the A version. GetProcAddress compatibility for ported 16-bit apps?

  2. Brian says:

    Ah, you didn’t mention UNICOWS (aka “Microsoft Layer for Unicode” or MSLU). A great solution to a problem that had mostly gone away by the time it shipped (it was released about the time that WinXP put a nail through the heart of Win9x).

  3. Antonio Rodríguez says:

    Joel, of Joel on Software, wrote back in 2003 (15 years ago!) an article titled “The Absolute Minimum Every Developer Must Know About Character Sets” ( https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ ).

    I completely agree with him. But would like to add one point: every developer *should* be able to recognize, at least, the most common types of mojibake (UTF-8 to/from ANSI, Unicode [UTF-16] to/from ANSI, and ANSI to/from OEM). Each one of the first four is very characteristic, and it would help your psychic powers tell in what part of the code the bug lies. The last two, involving the OEM sets/codepages, are a bit more difficult to spot, but are useful if you need to deal with the command line or console applications.

    1. French Guy says:

      ANSI isn’t a single character, but multiple 8-bit character sets with significant overlap. Treating any other encoding as UTF-8 is overwhelmingly likely to produce invalid data. Treating UTF-8 as an ANSI encoding (or OEM, which is also an ASCII superset) will garble non-ASCII characters into sequences of non-ASCII characters, making the result harder to read (with how much harder depending on the language). Treating UTF-16 as UTF-8 will clip the string to 0 or 1 character (depending on endiannes and assuming NUL as the string terminator) if the string only uses ASCII (otherwise, it won’t be valid). Treating UTF-8 as UTF-16 will produce completely unrelated characters (usually Chinese). And using the wrong 8-bit ASCII superset will substitute non-ASCII characters for others.

      1. Antonio Rodríguez says:

        By definition, treating a string with encoding X as encoding Y is incorrect, no matter what X and Y are, and thus, a bug. It doesn’t help to say “ANSI treated as UTF-8 will likely produce illegal bytes sequences”. When that happens, code is buggy or something has gone horribly wrong, and it’s of no use saying “it’s illegal”.

        About the confusion on ANSI: right, technically there isn’t such thing as an “ANSI encoding”. But in informal talk, “ANSI” refers almost always to Windows-1252, itself a superset of ISO 8859-1 (a.k.a. Latin-1). In much the same way, “OEM” usually refers to DOS codepage 437 (or one of its variations, such as 850, if you are outside the USA).

        The nearest thing to “ANSI encoding” are the ISO 8859-x encodings, and the Windows-125x code pages, which are themselves related (but not equivalent) to several ISO-8859-x encodings. But if there are several Windows codepages, why is ANSI synonym of Windows-1252? I can only speculate, but Windows-1252 was used from the first version of Windows (1985), while eastern variations were introduced in the 90s, when Windows NT was about to bring Unicode support, a better solution to the problem. So those variants didn’t get as popular as Windows-1252.

Comments are closed.

Skip to main content