Converting to Unicode usually involves, you know, some sort of conversion


A colleague was investigating a problem with a third party application and found an unusual window class name: L"整瑳整瑳". He remarked, "This looks quite odd and could be some problem with the application."

The string is nonsense in Chinese, but I immediately recognized what was up.

Here's a hint: Rewrite the string as

L"\x6574" L"\x7473" L"\x6574" L"\x7473"

Still don't see it? How about looking at the byte sequence, remembering that Windows uses UTF-16LE.

0x74 0x65 0x73 0x74 0x74 0x65 0x73 0x74

Okay, maybe you don't have your ASCII table memorized.

0x74 0x65 0x73 0x74 0x74 0x65 0x73 0x74
t e s t t e s t

That's right, the application took the ASCII string "testtest" and just treated it as a Unicode string without actually converting it to Unicode. When the compiler complained "Cannot convert char * to wchar_t *" they just stuck a cast to make the compiler shut up.

// Code in italics is wrong
WNDCLASSW wc;
wc.lpszClassName = (LPWSTR)"testtest";

They were lucky that the compiler happened to put two null bytes at the end of the "testtest" string.

Bonus psychic powers: Actually, I have a theory as to how this happened that doesn't involve maliciousness. (This is generally a good mindset to maintain, since most of the time, when people cause a problem, it's not willful; it's accidental.) Consider a library with the following interface header file:

// mylib.h

#ifdef __cplusplus
extern "C" {
#endif

BOOL RegisterWindowClass(LPCTSTR pszClassName);

#ifdef __cplusplus
}; // extern "C"
#endif

Somebody uses this header file like this:

#include <mylib.h>

BOOL Initialize()
{
    return RegisterWindowClass(TEXT("testtest"));
}

So far so good.

Meanwhile, the library implementation goes like this:

#define UNICODE
#define _UNICODE

#include <mylib.h>

LRESULT CALLBACK StandardWndProc(HWND, UINT, WPARAM, LPARAM);

BOOL RegisterWindowClass(LPCTSTR pszClassName)
{
    WNDCLASS wc = { 0, StandardWndProc, 0, 0, g_hInstance,
                    LoadIcon(IDI_APPLICATION),
                    LoadCursor(IDC_ARROW),
                    (HBRUSH)(COLOR_WINDOW + 1),
                    NULL, pszClassName);
    return RegisterClass(&wc);
}

The two files both compile successfully, and they even link together. Unfortunately, one of them was compiled with Unicode disabled, and the other was compiled with Unicode enabled. Since the header file uses LPCTSTR, the actual declaration of RegisterWindowClass changes depending on whether the code that includes the header file is compiled as Unicode or ANSI.

Result: If one file is compiled as ANSI and the other is compiled as Unicode, then one will pass an ANSI string, which the other will receive and treat as Unicode.

This is why functions in Windows which are dependent on whether the caller is compiled as ANSI or Unicode are really two functions, one with the A suffix (for ANSI) and another with the W suffix (for Wnicode?), and the generic name is really a macro that forwards to one or the other. It prevents TCHARs from sneaking past the compiler and ending up being interpreted differently by the two sides.

Comments (36)
  1. Anonymous says:

    What kind of self-respecting programmer doesn't have the ASCII table memorized?</snark>

  2. Anonymous says:

    Raymond's bonus insight is for me the bigger message to be remembered: don't use LPCTSTR and variants in your interface headers unless your library can really handle whatever T means by the library user or you find a way to report an error in the case of mismatch between the library compiled code and the one compiled by library user.

    Thanks Raymodn, I've never seen such cases even though they surely happen and I think should be covered.

  3. Anonymous says:

    The saying goes "never attribute to malice that which can be adequately explained by stupidity." aka Hanlon's Razor.

  4. Anonymous says:

    W => WCHAR => Wide Char, a common name for UTF-16 chars (which Windows often calls Unicode).

    Yes, the naming situation could be better.

    In general, I tend to put only LPWSTR interface, unless I'm still targeting Win98, which few people do these days.

  5. Anonymous says:

    With a nod to Michael Kaplan, I suggest that "Wnicode" be pronounced "Double Secret Unicode".

    I've always assumed the 'W' in those functions stood for 'Wide', like stdlib's wchar_t.

  6. Anonymous says:

    Intel's IC-86 cross-compiler embedded type information in its object files. If you made this type of error (#define something differently, include a file that gets interpreted two different ways due to the defines, and otherwise no compilation error) then you project would fail to link. That compiler caught just this situation for me, where I had:

    #define W 6

    in one file, and

    #define W 4

    in another.

    and a function foo(int x[W]);

    and the linker refused to link a call to foo(int[6]) to a definition of foo(int[4]).

    C++ compilers embed type information to resolve overloads, but IC86 was the only compiler I know of that did this for ordinary C code. Perhaps we could take a lesson from that.

  7. Anonymous says:

    Google Translate manages to make perfect sense of "整瑳", translating it as "The whole luster of gems"

  8. Anonymous says:

    Google Translate manages to make perfect sense of "整瑳", translating it as "The whole luster of gems"

    It also translates "Henri le cambrioleur" as "Henry the burglar", is this random string day?

  9. Did Henry the burglar steal the whole luster of gems?

  10. Did Henry the burglar steal the whole luster of gems?

  11. Anonymous says:

    I can't help but feel that the Unicode implementation in Windows has ended up to be rather unfortunate. At the time, I think it looked like 16 bits would be enough for any character, so going for UCS-2 made some sense. Now however, you still can't assume that any character fits into two bytes. In retrospect, a way to switch the existing ANSI functions to UTF-8 would have been much easier. I suspect though that Windows' Unicode implementation dates from before UTF-8…

    [You can use Wikipedia to support your suspicion. -Raymond]
  12. Raymond, you're wrong. The line was:

    wc.lpszClassName = reinterpret_cast<LPWSTR>("testtest");

  13. Anonymous says:

    Maybe the author just thought that s/he was dealing with UTF-8, in which case, no, there isn't a conversion: valid ASCII is valid UTF-8, you're done. The author may have been used to writing for the web, where UTF-8 is preferred by the standard and used by about two-thirds of web pages ( w3techs.com/…/all ), or may have been writing for Linux or Android anything Apple, which also use UTF-8. So I'd recommend maybe adding a caveat to your title, e.g., "Converting to Unicode usually involves, you know, some sort of conversion (where by `usually' I mean when writing for Windows)"

  14. Anonymous says:

    Did this unicode cast bug have anything to do with the problem the third part app had ? I assume not.

    Ah the unicode mess, when will they fix it correctly without making it even worse for the developers trying to use it.

    Reminds me when microsoft tried to fix Side-by-side Assemblies but in the end just made it even worse for the developers. Having to support even more confusing and complicated rules depending on what os it is and the time of day + if you have coffee the other day with one or two suger.

    Come to think of it. They are at it again, "fixing" things, this time its windows 8.

  15. Anonymous says:

    @WilliamF

    Does that mean microsoft isn't evil, "just" morbidly stupid ?

    Perhaps evil comes from stupidity that would make a bit of sense.

    (It would at least explain why christians do such horrible things in the name of their imaginary friend)

  16. @b: Just because you can often get away with it with UTF-8 (at least in the West), doesn't make it right.

  17. Anonymous says:

    "Just because you can often get away with it with UTF-8 (at least in the West), doesn't make it right"

    No, by definition, all ASCII text is valid and equivalent in UTF-8. An explicit design goal of UTF-8 is that you can always do that, and it's right.

  18. Anonymous says:

    Aneurin Price: I believe the point is that the only character encoding that doesn't have to be converted to UTF-8 is ASCII. This works out fine for Americans who rarely need any letters other than the standard 26 unaccented Latin characters, but not for the vast majority of the rest of the world.

  19. Anonymous says:

    The best is when your unicode application is using an API that has LPCTSTR's. Except the library is really LPCSTR only and they must have used the T because it was the type to use. I guess in hindshite I could have undevined (_)UNICODE before #including their city API and then just called everything without reinterpreting and then let someone put a capital L in front of everything if it was fixed one day (or delete the casts if it the api header was fixed and the compiler complained about WSTRs going into a STR). Oh well, 7 years ago, and it was more fun to complain in the comments and code at the time I guess.

  20. Ah, hence the 'extern "C"' which presumably removes all that goop that the linker would otherwise be able to use to distinguish between overloads of RegisterWindowClass.

    [A more likely possibility is that the function is exported by one DLL and consumed by another. The linker doesn't do inter-module analysis. -Raymond]
  21. Anonymous says:

    This is one reason that it's a good idea to avoid TCHARs entirely.  You don't need your stuff to compile on Win9x anymore, so just always use the UTF-16 version.  Our applications internally use UTF-8 for everything, then do whatever necessary conversion before calling the OS in the OS abstraction library.  UTF-16 is one-to-one with UTF-8, so this works losslessly.

    I prefer explicitly specifying the "W" when calling a Windows function to avoid being dependent upon particular #defines.  The macro thing kind of sucks sometimes – sometimes we'll get a linker error because some class ours has a function named SendMessage, and something calling it includes Windows.h.

  22. Anonymous says:

    Indeed, I sometimes wish there was a macro that disabled TCHAR and the function aliases entirely, so I could be sure all of my code only calls the A or W versions explicitly.

  23. @Gabe, @Aneurin Price: As Raymond likes to say “Everybody speaks English, right?”

    I faced a website which uses UTF-8 as char encoding and but believes one char is one byte. And this assumption is, well, wrong… that's why messages in Russian (Cyrillic) get randomly truncated.

  24. Will this actually compile?  I thought that the linker would complain about RegisterWindowClass taking a const char * in one compilation unit and a const unsigned short * in the other.

  25. Anonymous says:

    @acq (don't use LPCTSTR and variants in your interface headers unless…): That's too harsh. You can easily solve this exclusively on the library side. Compile it so that you can see which is the Unicode (and which MBCS) version, e.g. by appending U to the *.lib name, and in the header(s) of the library do a small #ifndef UNICODE #error we want unicode here.

  26. "Indeed, I sometimes wish there was a macro that disabled TCHAR and the function aliases entirely, so I could be sure all of my code only calls the A or W versions explicitly."

    I second that.  I really like the sound of it – it would avoid accidental mistakes (and also avoid Myria's problem of name conflicts with the macros).

  27. DWalker59 says:

    @Gabe: "…Americans who rarely need any letters other than the standard 26 unaccented Latin characters"  

    Don't we get to use both uppercase and lowercase letters?  That's 52 right there!  

  28. Anonymous says:

    This reminds me of people who think they can convert files to different formats by renaming the extension.

  29. Anonymous says:

    @aylivex: "I faced a website which uses UTF-8 as char encoding and but believes one char is one byte. And this assumption is, well, wrong…"

    It's the definition of "char" in C and C++. ;)

  30. Anonymous says:

    @Frederik Slijkerman: "In retrospect, a way to switch the existing ANSI functions to UTF-8 would have been much easier."

    If Microsoft were motivated to do that, they probably could still do it by allowing CP_UTF8 to be used as a code page instead of just as an argument to MultiByteToWideChar/WideCharToMultiByte.

  31. Anonymous says:

    @DWalker: It is naïve to assume that English can entirely be written without diacritics.

  32. Medinoc says:

    @Those who want to disable TCHAR: What will you do the day Windows goes UTF-32? ;-)

    More seriously, I've noticed a problem with the FooA/FooW functions, or rather a problem with MFC: It simply uses TCHARs and the Foo function names in its declaration, so it doesn't allow the user to choose which to use, which can be a problem when one user needs to use explicitly one version.

  33. Medinoc says:

    PS: What's the problem with translating "Henri le cambrioleur" into "Henry the burglar"?

  34. Anonymous says:

    So, the decision to use UTF-16 BMP for Unicode in Windows instead of UTF-8 – was it malicious or can it be explained by stupidity?

    [Why didn't NASA use the Space Shuttle to rescue the Apollo 13 astronauts? -Raymond]
  35. Anonymous says:

    @Maxim: this has been answered many many times – by the time UTF-8 was developed, NT was already late, and replacing UCS-2 with UTF-8 was not something you'd want to do at that stage anyway.

  36. Anonymous says:

    So you consider the use of UTF-16 in Java also malicious and/or stupid? Having Windows, Java and .NET using the same approach is actually a very good thing.

    Contrary, the use of UTF-8 in database interfaces can make life very complicated. How many characters can you store in/read from a VARCHAR(40) database column when the database uses UTF8 for the length semantics? What maximum amount of characters do you allow for the input field in your business application which corresponds to this database field?

    How exactly is UTF-8 supposed to solve this problems more easily then UTF-16?

Comments are closed.