Before claiming that a function doesn’t work, you should check what you’re passing to it and what it returns

Before claiming that a function doesn't work, you should check what you're passing to it and what it returns, because it may be that the function is behaving just fine and the problem is elsewhere.

The Get­Current­DirectoryW function does not appear to support directories with Unicode characters in their names.

wchar_t currentDirectory[MAX_PATH];
GetCurrentDirectoryW(MAX_PATH, currentDirectory);
wcout << currentDirectory << endl;

The correct directory name is obtained if it contains only ASCII characters in its name, but it truncates the string at the first non-ASCII character.

If you step through the code in the debugger, you'll see that the Get­Current­DirectoryW function is working just fine. The buffer is filled with the current directory, including the non-ASCII characters. The problem is that the wcout stream stops printing the directory name at the first non-ASCII characters. And that's because the default locale for wcout is the "C" locale, and the "C" locale is "the minimal environment for C translation." The "C" locale is useless for actual work involving, you know, locales. You will have to do some language-specific munging to get the characters to reach the screen in the format you want, the details of which are not the point of today's topic.

In other words, the bug was not in the Get­Current­DirectoryW function. It was in what you did with the result of the Get­Current­DirectoryW function.

Here's another example of thinking the problem is in a function when it isn't:

The Set­Window­TextW function does not appear to support Unicode, despite its name.

wstring line;
wifstream file("test"); // this file is in Unicode
getline(file, line);
SetWindowTextW(hwnd, line.c_str());

If you look at the line variable before you even get around to calling Set­Window­TextW, you'll see that it does not contain the text from your Unicode file. The problem is that the default wifstream reads the text as an 8-bit file, and then internally converts it (according to the lame "C" locale) to Unicode. If the original file is already Unicode, you're doing a double conversion and things don't go well. You then pass this incorrectly-converted string to Set­Window­TextW, which naturally displays something different from what you intended.

Again, the point is not to delve into the intricacies of wifstream. The point is that the problem occurred even before you called Set­Window­TextW. The observed behavior, then, is simple a case of Garbage In, Garbage Out.

Here's another example from a few years ago.

Comments (53)
  1. Unicode says:

    Does Windows by now properly support UTF-8, as ANSI-codepage, for console-IO and the like?

    Last I heard, iostreams on windows was hopelessly broken…

  2. Joshua says:

    Yes the iostreams are broken. When Unicode broke their original promise of fitting in wchar_t, it should have been abandoned. The standard libraries are littered with functions that can't work. The fix was made long ago and it is UTF-8. Yet MS went on and has forced a massive division.

    [Remember, Windows was using Unicode before UTF-8 was invented. So it's everybody else who created the division by going in a different direction. -Raymond]
  3. Charles Babbage says:

    On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

  4. Joshua: Whose definition of wchar_t?  Because they're not the same – it's compiler dependant, afaik.  The C90 standard only said it has to be big enough to represent all the possible character codes for supported locales.  So in theory a 32-bit Unicode character will fit in wchar_t, so long as the compiler considers it a supported locale.

  5. Joshua says:

    @Chris Crowther: But sticking UTF-16 in wchar_t was always wrong. UCS-2 was fine.

  6. dave says:

    Nah…. if you can't have fixed-size characters (which for a brief while was the promise of Unicode, with 16 bits appearing to be all we'd ever need) then the next best choice is a byte-stream.  UTF-16 is the worst of both worlds.

    I used to be a firm believer in "16 bits good, 8 bits bad" (formed when 16 bits meant UCS-2) but several years of having to deal with UTF-16 has disabused me of that notion.  But here's a much more eloquent writeup:…/verity_stob_unicode

    As a side note, how come we don't often see the internets sneering at the notion "65535 characters ought to be enough for anybody" the way they sneer at the infamous 640KB statement?

  7. Joshua says:

    Methinks parkrrr has never had to deal with split surrogate pairs.

  8. Wear says:

    @Dave because 65535 characters is enough for a lot of people. Hell, 128 characters is enough for a lot of people. Now you could rewrite windows entirely to use UTF-8 and cause problems for everyone or you could stick with the currently mostly working system that only causes problems for people doing major internationalization work. Which is the better option?

    One could also make the argument that the people doing major internationalization ought to know what they are doing and so the difference between using UTF-8 and UTF-16 is minimal.

  9. Joker_vD says:

    I am really interested to know why wcout's locale has any effect on how much of a wstring it will be printed.

  10. Crescens2k says:


    Wouldn't wchar_t only storing UCS-2 be against the standard? The definition of wchar_t from the standard is "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1)."

    Considering that VC supports Unicode as its largest extended character set, UCS-2 is just plain wrong. But again, I think this is history at work again, since IIRC, when Windows first used Unicode there was no UTF-16, only UCS-2. As the Unicode standard started to add more, then it became obvious that sticking with UCS-2 would be bad, but at the same time changing the character type for wchar_t was also bad.

    Anyway, why would surrogate pairs be a bad thing when UTF-8 does similar things? Anyone who is used to dealing with character sets that can use more than one value per character would have no difficulty with UTF-8, UTF-16 or any of the multi-byte locale based character sets. Methinks the problem is always in the programmer who never accounts for these things.

  11. Crescens2k says:


    The problem is the STL. There have been proposals to update it to be Unicode aware, but right now it isn't Unicode friendly at all.

  12. alegr1 says:

    >Wouldn't wchar_t only storing UCS-2 be against the standard?

    Only if you have a working time machine. Ask Raymond for the pass to the secret lab.

  13. Unicode says:

    @Joshua: I explicitly spoke about UTF-8 there, and I do not know of any reason for why it is and stays broken. Anyway, standard Facets are broken for Unicode whatever width one takes, yea, but that's another higher-layer issue.

    Regarding fixed-size characters: If one takes that to mean graphemes (aka printing characters and the like) instead of codepoints, even UTF-32 is not fixed-size, and never was.

    @Wear: Even US-English uses characters outside ASCII.

    @Crescens2k: Surrogate pairs are worse than multi-byte codepoints in UTF-8, because they a) Make encoding detailds bleed into the character set definition, b) Too many people don't test them properly (lazyyness and ignorance are wide-spread there), and c) They gull people into thinking one codeunit is one character (and if they avoid that trap, one codepoint is one character). UTF-8 is far less susceptible to that, because multi-codeunit codepoints are far more common, thus people learn the error of their ways nearly immediately.

    @alegr1: Being against the standard (which imho is eqivalent to being wrong, especially here), is independent of any reasons for why the standard is not followed.

    Yes, Windows has good reason, but that's beside that point.

  14. John Doe says:

    For efficiency, you'd rather stick with UTF-16 minus composition and surrogates.

    For interoperability, you'd rather stick with UTF-8 and deal with composition and surrogates.  This doesn't map efficiently on Windows, so you can stick with UTF-16 if that's your priority, sacrificing everyone else.

    Even with UTF-32 or whatever is it that can handle full code units, you still need to think about composition, at the cost of extra space you'll probably never need to use.

    Damn Unicode and its composed characters, versions and useless characters, it ought to be called Multicode-dings already.

    Oh, and I almost forgot about RTL, LTR…

  15. Nico says:

    @Charles Babbage:  Love it.  Big fan of your work.

  16. Crescens2k says:


    a) History and compatibility makes things messy you know. What would have you done? With the change between Unicode 1 and 2, would you have just redone all the sizes and broke existing Unicode conformant code? Back then they were struggling to get people to change over, and just breaking existing code wouldn't have helped the cause any.

    b) Programmers are programmers.

    c) But one code point isn't always one character even without surrogates. The Unicode version 1 standard defined several combining characters. These have increased dramatically ever since. Also, who gulls people into thinking one code-point is one character? From the Unicode FAQ "No. The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit." Or from the standard itself "UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5." Unicode itself doesn't gull anyone into thinking it is a fixed size encoding.

  17. Joshua says:

    The consequence of using UTF-8 is you tend to break exactly once, the first time you move outside of 7 bit ASCII, rather than three times, the first time out of Windows-1252, the first time out of the local encoding, and the first time out of the Unicode basic plane.

    Furthermore, opendir(), readdir(), closedir() are standard. What do you think happens when processing a directory on Windows with readdir() and fopen() when encountering a file whose filename is not in the unicode basic plane.

    Answer: it doesn't work.

  18. Unicode says:


    I know why a) was done (and why it is making the best of a bad deal), just listing it as a bad point.

    Point b) is not that programmers can be lazy or ignorant, but that it quite rampant there. Many are proud of doing it the wrong way and defend it to the death.

    About point c), there's a definition of character in the official document, which makes it denote, depending on context, codepoint as well as grapheme (one or more codepoints). The point was that the obviously variable length of a UTF-8 "character" makes the error less prevalent, naturally avoided and easier corrected there.

    Anyway, do you know why UTF-8 with iostreams on Windows was, is, and hopefully will not stay broken many more decades?

  19. theultramage says:

    You can do UTF-8 output on Windows, just do SetConsoleOutputCP(CP_UTF8) or _setmode(_fileno(stdout), _O_U8TEXT). You do need to switch the font to something like Lucida Console, though. I assume that there's a toggle for iostreams somewhere as well.

  20. Joshua says:

    @theultramage: try C:> chcp 97001

    The console editor barfs pretty bad but UTF-8 output to the console suddenly starts working. I suspect all bugs are completely fixable with no backwards compatibility problems. (And if there are any the workaround is don't run the broken applications under chcp 97001) and since there is no other way to convince the system to use a UTF-anything console …

    [How do you fix IsDBCSLeadByte? -Raymond]
  21. Crescens2k says:


    Invalid code page

    Is what cmd gives me.


    Well, there are several reasons, but iostreams just isn't Unicode friendly. If you want to write UTF-8 text, right now it is better to just use the CRT functions or the Windows functions.

  22. DWalker says:

    It's funny that no one has yet mentioned anything about the actual point of Raymond's post.  :-)

  23. parkrrrr says:

    It's not true that UTF-8 is "the" fix. There are other fixes for the problem, too, including the far saner* UTF-16.

    * Relative sanity of UTF-16 not guaranteed and may vary with user's primary language.

  24. Azarien says:

    @Raymond: "How do you fix IsDBCSLeadByte?"

    You don't. There are issues that deserve to stay broken for greater good.

  25. Joshua says:

    @Crescens2k: Oops I got it mixed up with the CPT Code for Initial Eval. The number is 65001.

    [How do you fix IsDBCSLeadByte? -Raymond]

    Trivial. Of all the assumptions about code page byte length, you just had to fix the one that isn't broken. In addition, this is one place where you can get away with all the breaking changes you wan, for two reasons.

    1) chcp 65001 is only ever being attempted for this use, and right now people expect some breakage so the behavior is completely opt-in.

    2) It's the only hope for ever getting Unicode console as nobody expects null bytes on TEXT STDIN or STDOUT.

    Don't point me to Powershell's new console. It's a lost cause because to put back what is missing is to need to go right back down this same rabbit hole again after twice as much work.

    [Not sure what you mean by "isn't broken". Applications assume that if IsDBCSLeadByte is true, then the next character is the final byte of the character. (This is required by DBCS, because a lead byte in trail position is a trail byte.) Applications like, say, Explorer. -Raymond]
  26. Joshua says:

    Raymond, OEMCP not ANSICP. Explorer's not going to see it at all. I'm pretty sure that IsDBCSLeadByteEx returns false if not operating on the magic 5 OEMCP that actually are MBCS and can stay that way.

    This leaves only conhost.exe (the editor itself), cmd.exe, and find.exe. There really aren't that many console programs that process text (as opposed to simply passing it around) in Windows.

    Oh, sorry, I missed that this was setting only the console code page. I don't think this change OEMCP, though. I'll have to play with it. -Raymond]
    chcp 65001 doesn't affect GetOEMCP(), so console apps that use WideCharToMultiByte(CP_OEMCP) [i.e., any console app that has localized text] will generate garbage output. (Note also that all these comments have nothing to do with the topic of the article. UTF-8 in the console is a topic for Michael Kaplan, not me, and I believe he has written about it rather extensively already.) -Raymond]
  27. Azarien says:

    Leaving UTF-8 aside, event UTF-16 support sucks.

    Why do I have to bother with locales when I'm already using *w*cout, *w*string and *w*char_t, which are supposed to use UTF-16 on Windows?

    Yes, I know there are some typographical issues between e.g. Hong Kong and Taiwan locales, but having a dumb "C" locale that does not work at all as a default is not a solution.

  28. Karellen says:

    Why would IsDBCSLeadByte() ever return true under UTF-8, given that UTF-8 is not a DBCS, and therefore no bytes in a UTF-8 stream are DBCS lead bytes? Given that UTF-8 is self-synchronising, it would seem that IsDBCSLeadByte() is unnecessary in such a locale anyway, and that MultiByteToWideChar() should "just work".

    [If IsDBCSLeadByte always returns false, then code that parses 8-bit strings will assume one character = one byte. This will result in weird things like code that tries to wrap long lines at 80 characters wrapping at only 20; or code that tries to remove the last character from a string chopping off a partial character, leaving invalid utf-8 behind. -Raymond]
  29. Dan says:

    @Karellen: I think the question is, why do iostreams convert wchar_t* strings from UTF-16 to ANSI and back to UTF-16, losing information, instead of just writing UTF-16 to the console directly?

  30. @Raymond says:

    Referring to Michael Kaplans blog would be much more useful if it was still online.

    As is, most searches lead one to a not-found page at microsoft.

    Only combining that with the internet archive helps:…/michkap

    Still, I had some small hope MS C++ implementation got better with UTF-8 by now.

  31. Karellen says:

    @Azarien: "Why do I have to bother with locales when I'm already using *w*cout, *w*string and *w*char_t, which are supposed to use UTF-16 on Windows?"

    Because locales are about a lot more than your character set. Like, which language is used by the OS/apps, which is kind of important if you want things to be usable by people by people who speak a different language from the author. Or how the digits in large numbers are grouped, and which character is used as a decimal mark for non-integers. Or how to format/parse dates like 01/02/03.

  32. laonianren says:

    While we're on the subject of utf8 brokenness…

    The WriteFile function returns the number of bytes written.  This has been true since Windows NT 3.1.  Lots of code depends on it.

    However, if you write to a console with the codepage set to 65001, WriteFile returns the number of *characters* written.

    MS refuse to fix this, yet (years later) there's still no mention of this feature in the WriteFile documentation.…/unicode-issues-with-writefile-and-in-the-crt

  33. Joshua says:

    Wow that WriteFile bug is /nasty/ and given the CRT expects it to work I'm wondering how my testing got as far as it did.

  34. John Doe says:

    In fact, any program that may read or write to a console and which is Obviously™ well written to loop until all byte are written will fail ridiculously due to a Win32 bug that won't be fixed.  And to think it depends on the font tells even more, it's the console client that fails, not the console server.

  35. 640k says:

    A company which develops and sells the os and the compiler/ide, and is part of the programming language committee, doesn't have any excuses for that code to fail. Stop blaming others and just make it work.

  36. cheong00 says:

    @Wear : The nice thing about UTF-8 is that when only lower 127 ASCII range is used, it's exactly the same as ASCII table. That means for most programmer that don't work at Microsoft, there will be need to call one API version only and we can ensure old program continues to work.

    Now since the API selected to use UTF-16 which requires to insert prepending nulls to low ASCII ranges, it means we can't just ditch the …A() API functions or some old programs (or newer programs that not written with Unicode directive enabled) will fail.

  37. cheong00 says:

    @@Raymond: FYI, Michael Kaplan have created a new blog and moved all old posts from internet archive over there. No need to bother the internet archive for this one.

  38. Joshua says:

    Well your reply on WideCharToMultiByte makes it possible to now bring this back on topic. Let's say a hypothetical question

    WideCharToMultiByte doesn't work when the console code page is changed:

       SetConsoleCp(866) // Russian

       SetConsoleOutputCp(866) // Russian

       WideCharToMultiByte(CP_OEMCP, …, lpszConsoleOutputBuf, …)

       WriteConsole(…, lpszConsoleOutbugBuf, …)

    Answer: The call to WideCharToMultiByte is wrong. Use

       SetConsoleCp(866) // Russian

       SetConsoleOutputCp(866) // Russian

       WideCharToMultiByte(GetConsoleOutputCP(), …, lpszConsoleOutputBuf, …)

       WriteConsole(…, lpszConsoleOutbugBuf, …)

    This behavior leads me to think that OEMCP is good for almost nothing. The only *possible* use I can find is interpreting filenames on MS-DOS FAT disks (no long file names).

  39. wen-xibo says:

    Dear Raymond: I had read your document about RealGetWindowClass at…/10110524.aspx  .  

    But unfortunately , I cann't get the base class name from a simple superclass from "BUTTON" class .

     My new class is very simple : superclass "BUTTON" and transfer all of message to original class wnd proc . But RealGetWindowClass return "MyButton" same as GetClassName , I test program on XP and Win7 and get same result .

     I want to debug into RealGetWindowClass but I am very poor on use windbg .    I hope a method can found the base class of a superclass .

     I need your help, thank you very much .

  40. parkrrrr says:

    The entire discussion of UTF-8 vs. UTF-16 is very English-centric. Half the people on Earth speak a language that requires, on average, three or more bytes per codepoint in UTF-8 vs approximately two bytes per codepoint in UTF-16. Thus the footnote in my original comment: UTF-8 looks sane if your language (human, not programming) only uses the Basic Latin charset. It looks a lot less sane if you use pretty much anything else.

    And any argument that UTF-8 is easier to deal with is going to have to somehow square that with the fact that pretty much no website on the planet manages to render the ubiquitous U+2019 properly.

  41. Joshua says:

    @parkrrr: Comparing byte density of letter symbols to word symbols is not fair. Three bytes per word is still much better than English.

  42. alegr1 says:


    Cyrillic, Armenian, Georgian, Hindu, Tagalog,…

  43. frenchguy says:

    Actually, any language that uses a script other than latin, except Chinese. Non-English latin-script languages probably get an average a little over 1 byte per character in utf-8.

  44. @parkrr says:

    There's a small sliver where UTF-16 has denser storage than UTF-8 by 2:3, yes.

    But there's a much more important part where UTF-8 win 1:2.

    Before you discount that, please take into account that even predominantly chinese texts can and do often have snippets of ASCII too. But that is not the most important reason that is a false economy: Textual computer protocolls and data-formats (HTML, HTTP, myriad others) use basic ASCII nearly exclusively. Thus even a pure chinese html page will be no bigger (probably smaller) using UTF-8 than UTF-16. Anyway, it would not hurt if chinese used double the amount of bytes for each symbol english uses: Equivalent text would still be shorter.

  45. j b says:


    I'm not getting it. What's the problem with rendering U+2019 (RIGHT SINGLE QUOTATION MARK)? I see it the way it is supposed to be all the time.

    And, what has the web site to do with the rendering? I leave the rendering of all characters, Unicode or not, to my local browser, not to the web site.

  46. Joshua says:

    @j b: Believe it or not, FTP is UTF-8 now, and this was done mostly without upgrading the old implementations. I'm pretty sure the Windows client broke but all the old UNIX clients and servers that don't know what Unicode is work just fine.

  47. John Doe says:

    @j b, they're plain 7-bit extended to 8-bit.  That is, the upper 0 bit is effectively useless communication overhead.  But not useless in the sense that current day machines all deal with octets at some point, so I guess low-level (wire/wireless level) communication should go like "Hey, the upper-level protocol is guaranteed to send 7-bit characters, let's ditch the extra bit for the next ?? , shall we?"

    Except that every protocol you can think of being purely 7-bit no longer is, due to server extensions, client extensions, charset content-type support, and what not.  Mainly because the wire protocols also guarantee octet transmission, in case that 7-bit protocol is actually only 7-bit for text content.

    And lower-level protocols are better off with fast (de)compression.

  48. j b says:


    Protocol elemnents like HTTP (HTML is not a protocol – not any more than, say, JPEG!), SMTP, FTP, … are *never* (at least as of today) UTF-8 encoded! They are plain 7-bit ASCII – not even 8-bit 8859-x! The only time where you might run into a choice of how to represent these protocol elmeents would be in documentation. And then they are document contents, not protocol.

  49. Hm says:

    @Joshua "WideCharToMultiByte(GetConsoleOutputCP()"

    Combined with "MultiByteToWideChar(Windows.GetConsoleCP()" at the reading side, this seems to be the only correct way for console programs. When you type something like "cat file|find bla|find blub", the pipe for the redirected stdin/stdout channels has no agreed character semantics in itself. Because you already need to convert your input/output according to the actual console codepage, and probably don't want to detect the pipe case anyway, this would give both processes a common codepage. (If not this way, how else?)

    But is there any situation where GetConsoleCP() and GetConsoleOutputCP() may return different values for a new process? Given that there are no "character pipes" to connect console processes, the concept of different codepages for inout and output seems very strange to me.

  50. j b says:


    I wasn't aware of that FTP extension (haven't been working with FTP at the protocol level for a while!), but I notice that the RFC 2640 extension applies to the file name parameter only, not to the protocol as a whole. Furthermore, RFC 2640 is still a "Draft Standard" – not that it means much, but it isn't mandatory at the same level as plain old FTP.

    Anyway, thanks for making me aware of this extension.

    (Sidetrack: Do anyone still use old 56 kbps lines at your side of the pond, letting the phone company steal the 8th bit for signalling? Or is that long gone history today, even in the US?)

  51. j b says:

    @John Doe,

    In the old days of 56 kbps lines, the upper bit certainly wasn't "useless communication overhead" – it was what enabled the phone company to put you in contact with your communication peer. Every 6 sample, the signalling stole that bit for communication among the phone switches; that's why data communication couldn't make use of it, but the switches did. (In Europe, we never did that; for signalling between switches we had dedicated channels, so we could use the full 64 kbps capacity for user traffic, with full 8 bit data. Besides, we were much more in favor of dedicated digital networks, bit oriented such as X.21 networks, rather than "reusing" the old phone networks.)

    Even though bit 8 today is always(?) zero, you cannot say tha 7-bit protocols are "extended to 8-bit"! They were defined using 7-bit ASCII, and 7-bit ASCII implies that bit 8 is available for other uses, such as parity, or in the case of 56 kbps lines, signalling purposes. The protocol specification doesn't implicitly change just because currently, noone sees a use for bit 8. You can change the specification, suc as with RFC 2640 for the file name parameter. But if e.g. someone suggests another extension using a command word containg characters outside 7-bit ASCII, it would for all practical purposes be unacceptable; old implementations would be unable to reliably parse the command and classify it as "not supported". (Note in RFC 2640 how you might have to insert NUL bytes in the file name to keep old implementations from breaking!)

  52. smf says:

    @j b

    "Even though bit 8 today is always(?) zero, you cannot say tha 7-bit protocols are "extended to 8-bit"! They were defined using 7-bit ASCII, and 7-bit ASCII implies that bit 8 is available for other uses"

    It doesn't make sense to treat data from a socket as anything other than 8 bit data. UTF-8 proponents have to deny that many .txt files have 8-bit data in them as well.

  53. Adam Rosenfield says:

    If you care about storage size, just compress it with your favorite compression algorithm and be done with it.  UTF-8 and UTF-16 compress pretty comparably.  The size argument is just not a valid argument these days on computers with GBs of memory and TBs of durable storage.  If you're truly processing GBs of text data, compression will do you far more wonders than changing from UTF-8 to UTF-16 or vice-versa.

    But all of the other arguments for UTF-8 over UTF-16 are still perfectly valid.

Comments are closed.