Keep your eye on the code page: Is this string CP_ACP or UTF-8?


A customer had a problem with strings and code pages.

The customer has a password like "Müllwagen" for a particular user. Note the umlaut over the u. That character is encoded as the two bytes C3 BC according to UTF-8. When the customer passes this password to the Logon­User function in order to authenticate the user, the call fails, claiming that the password is invalid.

If we encode the ü as the single byte FC, then the call to Logon­User succeeds.

Therefore, if the string is in UTF-8 form, it needs to be converted, and to do this we use the Multi­Byte­To­Wide­Char function. Once converted, the logon is successful.

The problem is that we are not sure if the password being given to the application will encode the ü as C3 BC or as FC. If it arrives as FC, and we try to convert it with the Multi­Byte­To­Wide­Char function, the ü is converted to U+FFFD.

If I take the FC-encoded string and convert it with the Multi­Byte­To­Wide­Char function, passing CP_ACP as the first parameter, then it converts successfully (no U+FFFD), and the call to Logon­User is successful.

For the application, the customer does not want to distinguish the two cases or implement any retry logic or anything like that. Can you help us understand the issue, what we are doing wrong, and how we can fix it?

As the problem is stated, you are screwed.

You have a bunch of bytes, and you don't know what encoding they are in. The byte sequence C3 BC might be a UTF-8 encoding of ü, or it could be a CP_ACP encoding of ý. You are stuck with guessing. But for something as important as passwords, you shouldn't guess. You need to know for sure, because an incorrect guess will generate audit entries, and may cause the user to become locked out of the account due to too many incorrect passwords.

This means that you need to make sure that whoever is passing you the string also tells you what encoding it is using.

The customer liaison replied,

Thanks. I went back and talked to the customer, and it turns out that the password is always in UTF-8 form, so the problem is solved. We will always pass CP_UTF8 when converting the string.

Comments (34)
  1. anonymouscommenter says:

    It amazes me how many developers still do not have a basic understanding of character encodings.  I imagine the customer was hoping there was a RandomGarbageToWideChar function that would just wave a magic wand over the bytes and convert them into the Right Thing for any given scenario.

  2. anonymouscommenter says:

    Similarly, it's amazing how developers don't understand that any bunch of bytes doesn't have any meaning unless there's some additional data describing the format it's in (such as the HTTP/MIME/etc. Content-Type header). This is just as true for various texty-type bunches of bytes as it is for graphical-type bunches of bytes. Reading a UTF-8 string assuming it's in Windows-1252 encoding makes exactly as much sense as reading a PNG image assuming it's in JPG encoding. And while in either case it's possible to make reasonable guesses some (or even most) of the time by looking at the bytes, it really makes much more sense to just carry the information about the type along with the bytes, since that way you know for sure.

    I sometimes wish that UTF-8 shifted the US-ASCII letters by one position or something, just so it'd be obvious all the time that it's being read in the wrong encoding, rather than developers assuming everything's working just because they'd never tried "one of the funny 'special' characters". All characters should be "special".

  3. Medinoc says:

    The huge boon for UTF-8 is that it can be validated, and that a most UTF-8 multibyte sequences, when converted to Windows-1252, give sequences that are not meaningful in any human language.

    Which means UTF-8 multibyte sequences are highly unlikely to appear in most CP_ACP text, which means [i]if it validates as UTF-8, it can be relatively safely assumed to be so[/i].

    Of course it's probabilistic, but hey, so are random GUIDs and other practices like identifying a file by its hash.

  4. anonymouscommenter says:

    @pc And how do you specify/encode the encoding? At some point, some code is going to have to make an assumption.

  5. anonymouscommenter says:

    Is he really using an "ANSI" variant of that Logon­User function, instead of the "Wide" variant? The Windows 9x series died a long time ago. Back then, it made sense to use the "ANSI" functions since they worked in both 9x and NT, but nowadays it's much better to just use ...W functions everywhere. If you want an 8-bit encoding, use UTF-8 and convert it to UTF-16 when calling Windows APIs.

    (Of course, that doesn't prevent all problems; the same way the Unix API is actually "a sequence of 8-bit numbers", the Windows API is actually "a sequence of 16-bit numbers", so you might have to deal with unpaired surrogates, which can't be converted to valid UTF-8. That's why the WTF-8 encoding exists, it can round-trip unpaired surrogates just fine.)

  6. anonymouscommenter says:

    @M: It depends.  GUI elements should be handing you data in a specified format.  Stdio is less predictable, assuming you want to be as portable as possible.  Anything over HTTP should have a declared encoding in both the headers and (for HTML) the meta tags.  Other situations have their own rules.

    But opening a TXT file is a crapshoot no matter how you slice it.  You'll need out-of-band information to make that work (like HTTP headers).

  7. anonymouscommenter says:

    @M: That's largely my point, that any system which transfers bytes without also specifying the meaning of those bytes (part of which is a character encoding if those bytes represent something texty) is a buggy system. The bytes don't mean anything without that. Something like the classic Windows convention of "a file ending in '.txt' contains only human-readable text" is wholly insufficient, as it does not convey enough information about the contents to be able to read it; it's just a meaningless bunch of bytes without some other out-of-band information or agreement about the contents of the file.

    And yes, often the name of the encoding is itself encoded in US-ASCII (such as in HTTP's Content-Type header), which is fine because that's what the specification all parties are using say. A specification describing how a program is getting data saying "all text is in UTF-16" or "all text is in EBCDIC" is perfectly fine as well, since that contains the out-of-band information about how to interpret the data. A specification saying a program gets "plain text from STDIN" is meaningless.

  8. anonymouscommenter says:

    Mandatory reading on the subject: joelonsoftware.com/.../Unicode.html

  9. Muzer_ says:

    There is, of course, the important caveat that, depending on what you want to do with the text, there are some instances in which you don't need to know its character encoding. For example, as long as you know it's encoded in (an 8-bit encoding or UTF-8), there's a lot of things you can do with text without actually having to parse it.

  10. anonymouscommenter says:

    @Cesar Depends. If you're stuck using garbage programming languages to update garbage applications, you may be stuck with no Unicode support, or support that would take aeons to implement.

  11. anonymouscommenter says:

    @Cesar That's what they are doing, note the use of "Multi­Byte­To­Wide­Char".

  12. Dave Bacher says:

    Often what happens with this is a program starts off on a US English system, then heads to some other system.  The programmers never considered this issue when they were designing, and it never comes up during testing.  Then a customer from another country purchases the program, and it has issues -- and so the developer is then stuck with "what can we do here that doesn't involve writing a lot of code?"

    I think it's safe to assume -- from the description, etc. -- that is likely what happened in this case, too.  They sent it out, they said "WTF, why is this failing" they traced it to a character set issue and were looking for something they could do in just one spot to fix it, versus having to fix it in n spots, in already deployed apps.

    That's the big issue here usually -- it's no big deal to go to the server, and make a change at the password function (* by no big deal, I mean "changing devices you control").  Changing a client, however, can cause problems depending on circumstance (e.g. if you have to log in for the app to self-patch, a somewhat common scenario in some apps, then not being able to login would complicate delivery of the patch).  You have users circumventing patches, or unpatched apps connecting up and attempting to authenticate -- and so if you didn't plan on this from the get go, it gets really ugly really fast.

    And it's something that at least our local colleges still do not require in the Computer Science classes.  However, they still require people who are going to graduate and write web back ends for line of business apps to have ridiculous amounts of math and engineering, because clearly that's much more important than basic data integrity or, you know, design/architecture.  It's not important at all for me to be able to communicate requirements to coworkers, as long as they can solve double integrals.

    Or, and I'm not asking for a ton here, but basic use of a debugger (which is also not taught, apparently).

  13. anonymouscommenter says:

    I have a few rules of thumb for encodings:

    1. When working with pre-existing file formats and protocols, carefully read what encodings they specify.

    2. If a pre-existing format or protocol has a means of specifying encoding: (a) consume any IANA-registered encoding; (b) produce UTF-8 unless explicitly overridden by the user.

    3. When designing new formats or protocols, use UTF-8 exclusively and note this in the specification.

  14. anonymouscommenter says:

    I'm always surprised at how web pages manage to mangle my last name. I can understand č being interpreted as č (or just Ä), but how do Microsoft's pages manage to encode it as è (which implies it's converted to CP1250 somewhere in their systems, then read as CP1252)?

  15. anonymouscommenter says:

    In the unlikely event that I'm communicating with an external system using a protocol or format that actually specifies an encoding (like XML), it's usually meaningless because the encoding is hard-coded as part of a wrapper that knows nothing of the data being sent.

    So a user will enter some text in their native encoding into a GUI app, which gets sent to a server that stores it as a byte string in a DB. Then I ask a web service for data from the DB and the web service tells me the data is UTF-8, when in fact it has no idea what encoding was used when the data was entered!

    I also have problems with systems that store times in the user's local time zone but tell me that it's in the server's current local time zone.

  16. anonymouscommenter says:

    @Dave Bacher

    MODERN encoding issues scream one of two things:

    1) The developer has no clue how to write software.

    2) The developer may actually be racist/nationalist, consciously or not.

    You'd be surprised, or maybe not, how often 2 is actually the case -- and not just for Western speakers/coders.

    Assuming that everyone in the entire world speaks your language, and that ONLY your language speakers will ever use anything in your country, is pretty racist/nationalist. Even if unintentionally so.

    Software that doesn't properly handle Unicode or foreign-encodings is usually not Accessible -- uses Red/Green all over the place for Bad/Good, doesn't expose UIA properties for screen-readers, makes use of seizure-inducing flashes to get the user's attention, etc. etc.

    I can accept most of these problems in software from the Pre-2000s, but we've known that we NEED to handle this for 30 years, and have had EASY ways to handle all of this for at least a decade.

  17. anonymouscommenter says:

    "I sometimes wish that UTF-8 shifted the US-ASCII letters by one position or something, just so it'd be obvious all the time that it's being read in the wrong encoding, rather than developers assuming everything's working just because they'd never tried "one of the funny 'special' characters". All characters should be "special"."

    Well, that would kill one of the pillars for the design of UTF-8: Making it an extended ASCII character-set, which means all ASCII is UTF-8 and software which does not care which ASCII extension is in use, and there is much of that, would no longer silently work with it.

    That's actually also the reason UTF-8-BOM is an abomination...

  18. anonymouscommenter says:

    @pc:  Very good points.  We are able to mis-read UTF8 as ASCII quite often (I think I'm saying the right thing) without paying any penalties.  That's probably a bad thing for internationalization.

  19. anonymouscommenter says:

    If you are using VS.NET, this addon can help:

    visualstudiogallery.msdn.microsoft.com/540ac2d8-f881-4794-8b00-810d28257b70

  20. anonymouscommenter says:

    @Vahid

    Mangling the file by removing the BOM doesn't help.

    If an application doesn't recognise a BOM, the broken application should be fixed to recognise the BOM.

  21. Myria says:

    I wish that programs could set their code page to CP_UTF8, so that the *A versions of the Win32 API took UTF-8 strings instead of (typically, for this audience) Windows-1252.  This would make portability easier, since the other major OS's all use UTF-8.  In fact, the only native-code systems that use UTF-16 that I can think of are all from Redmond: Windows NT (+ XBone), 3DS, Wii U.

    [That would be using a global solution to a local problem. Suppose you set your app to CP_UTF8 (i.e., all *A functions use UTF-8 instead of CP_ACP). Then some code in a DLL executes, and it can't handle characters that take more than 2 bytes to encode and it blows up. Theoretically, there could be a *U8 version of every function that takes UTF-8, but that's also something that could be handled by the application, similar to unicows. -Raymond]
  22. anonymouscommenter says:

    @Yuri Khan: I'd like to append:

    4. When writing an application, create an operating system abstraction layer.  Everywhere within the application except two kinds of places--this OS abstraction layer, and any module that must work with files or data that may be of another encoding--use precomposed UTF-8.  Thus, your application "thinks" in precomposed UTF-8, and only ever uses anything else at the interprogram boundaries.

    Your OS abstraction layer then translates to and from the appropriate encoding when calling into the OS system calls.  For Windows, that means doing UTF-8 <-> UTF-16 translation.  For Mac OS X, that sometimes means doing precomposed <-> composite translation within UTF-8, but usually this is not needed.  For most others, including Linux, no translation is needed.

  23. anonymouscommenter says:

    @Myria: I actually disagree about precomposed Unicode (more accurately, the Normalization Form C).

    Firstly, precomposed form only exists for a handful of characters that existed in pre-Unicode encodings, as a backward compatibility measure. E.g. there are Latin vowel + diacritic accents composites, because those were in CP437. But there are no Cyrillic vowel + accent composites. Similarly, there are almost no precomposed forms with more than one diacritic. The decomposed form is simply more generic.

    Secondly and more importantly, automatically normalizing the user’s data, in any direction, is disrespectful. They might depend on the subtle encoding difference (e.g. to demonstrate a font rendering difference between a precomposed and decomposed forms). They might be using versioning tools that will report spurious diffs between original and normalized files. (Good luck getting git diff to ignore normalization-only changes.)

    Of course, the application is free to pefrorm temporary internal normalization for the purposes of text processing.

  24. anonymouscommenter says:

    @Myria

    Every person in the world that reads and manipulates a language other than English would rather that everyone switch to UTF-16, at least.

  25. anonymouscommenter says:

    Windows for Workgroups did try all passwords as UPPERCASE and Uppercasefirstletter and lowercase when logging on. Why can't NT do this also? Please add the code back (Ctrl+C, Ctrl+V).

  26. anonymouscommenter says:

    That's where you are wrong. Since UTF-16 is no longer UCS-2, there's no advantage left to using it instead of UTF-8. Quite the opposite even.

    Now normalizing withough need, that can be bad...

  27. Alex Cohn says:

    The comments above criticize the way the client side has been implemented, and many are very true. But there is the other side to this specific situation, the server side. The password verification procedure.

    Why should it require "Latin letters and digits, at least 7 characters but no more than 9 characters"? How often do we see such restrictions?

    Also, again and again, we see that the system will not recognize the password if it was typed by mistake with wrong keyboard language layout, or Caps Lock. Why, for goodness sake? Does accepting mYcAMELtYPEsECRET or ЬгСфьудЕгзуЫускуе really compromise system security?

  28. anonymouscommenter says:

    @Alex Cohn:

    > Why should it require "Latin letters and digits, at least 7 characters but no more than 9 characters"?

    Having a restriction on maxumum length and allowed characters is a sure sign that there's something wrong with the way passwords are stored (passwords should always be stored as a salted one-way hash, and thus not have any length requirements).

    > Also, again and again, we see that the system will not recognize the password if it was typed by mistake with wrong keyboard language layout, or Caps Lock. Why, for goodness sake? Does accepting mYcAMELtYPEsECRET or ЬгСфьудЕгзуЫускуе really compromise system security?

    When you store passwords properly (as a hash), there's no way to know that ABC and abc are the same password - their hashes are different. Even more so for different keyboard layouts.

  29. Medinoc says:

    "Theoretically, there could be a *U8 version of every function that takes UTF-8" I think there *should*, and haven't given up hope that there might be someday, and this is why it's indispensable *not to ditch the TCHAR*. The TCHAR is the key to lessening the pain of transitioning to UTF-8 versions of each function.

  30. anonymouscommenter says:

    I think @640k is on to something. We should really just start entering the user's password for them, that way they don't need to remember casing at all.

  31. Medinoc says:

    Alex Cohn might be on to something too: Instead of memorizing the password itself, why not memorize the coördinates of the keys used in typing it?

  32. Alex Cohn says:

    @ender: I have no problems with the fact that password lengthy and constitution are regulated; the problem is that too often these regulations are crazy. If a password will be typed by humans, make you software compensate for some most typical human confusions, not the other way around. With passwords the challenge is even more important, because the user cannot see what she types, and can only fix by retyping all once again.

  33. Dan says:

    @Myria: I agree 100% with your wish to support UTF-8 as an ANSI code page.  If not for the Win32 API, then at least for the MSVCRT.  As it is, you can't even use Standard C functions like fopen or printf if your strings are in UTF-8.  That makes it *extremely* painful to write cross-platform code that needs to handle non-ASCII characters.

  34. Joshua says:

    [Theoretically, there could be a *U8 version of every function that takes UTF-8, but that's also something that could be handled by the application, similar to unicows. -Raymond.]

    That's not even the problem. The problem is the console borks itself on certain kinds of UTF-8 proceeding. I could solve the whole rest of it *today* if the console code in csrss.exe could really accept UTF-8. Trying to fix the rest of it w/o that ends up backing into recreating Cygwin.

Comments are closed.

Skip to main content