How do I convert a UTF-8 string to UTF-16 while rejecting illegal sequences?


By default, when you ask Multi­Byte­To­Wide­Char to convert a UTF-8 string to UTF-16 that contains illegal sequences (such as overlong sequences), it will try to muddle through as best as it can. If you want it to treat illegal sequences as an error, pass the MB_ERR_INVALID_CHARS flag.

The MSDN documentation on this subject is, to be honest, kind of hard to follow and even includes a double-negative: "The function does not drop illegal code points if the application does not set this flag." Not only is this confusing, it doesn't even say what happens to illegal code points when you omit this flag; all it says is what it doesn't do, namely that it doesn't drop them. Does it set them on fire? (Presumably, if you omit the flag, then it retains illegal code points, but how do you retain an illegal UTF-8 code point in UTF-16 output? It's like saying about a function like atoi "If the value cannot be represented as an integer, it is left unchanged." Huh? The function still has to return an integer. How do you return an unchanged string as an integer?)

Comments (11)
  1. C says:

    Illegal code points get the Old Yeller treatment.

  2. Adam Rosenfield says:

    This was starting to look like another edition of "Raymond explains something which MSDN documented poorly" post until I hit the "Does it set them on fire?" bit.  I love your dry humor, Raymond.

    In a test that I did, it looks like it converts illegal sequences into U+FFFD (REPLACEMENT CHARACTER), and it does it as soon as it detects an illegal sequence.  So the 5 bytes F9 AA AA AA AA (which would code the invalid code point U+1AAAAAA) get converted into 5 instances of U+FFFD, since F9 isn't a valid initial byte.  The 3 bytes ED A0 B4 (which would code the invalid code point U+D834, a low surrogate) get converted into 2 instances of U+FFFD, since ED is a valid initial byte, but no valid code sequence begins ED A0 (they all lead to low surrogates), and the remaining B4 isn't a valid initial byte.

  3. Joshua says:

    I could make an argument that the replacement logic is wrong because of the UTF resync logic, but I could make a stronger argument that not passing flag MB_ERR_INVALID_CHARS is always a bug.

  4. Joshua says:

    I could make an argument that the replacement logic is wrong because of the UTF resync logic, but I could make a stronger argument that not passing flag MB_ERR_INVALID_CHARS is always a bug.

    What gives about the comment dropping?

  5. Joshua Ganes says:

    @Joshua – I wouldn't say that calling it without MB_ERR_INVALID_CHARS is "always" a bug. There may be circumstances where you expect illegal code points (e.g. Third party data source with less vigorous standards). In this case, you would want to do something sensible with the illegal code points while preserving as much valid data as possible. It would be nice if the documentation was more clear on this behavior.

  6. Calling it without the flag is always a bug, because it’s relying on undocumented behavior! :D

  7. R. Bemrose says:

    Clearly it sets them on fire because they're a Spy!

    And I play way too much Team Fortress 2.

  8. Worf says:

    The more fun way – strip off the high byte and treat as a char. Converting then is trivial.

    output = (uint16_t)(input & 0x7F);

  9. JM says:

    @worf: that's for people who never understood what this whole Unicode thing was about anyway, and don't see why you would need to pad ASCII characters with NUL bytes — but hey, if people want that we can sure provide it…

  10. And what do we burn apart from illegal code sequences? More illegal code sequences!!

  11. Tim says:

    Joshua: "[…] not passing flag MB_ERR_INVALID_CHARS is always a bug."

    If browser vendors followed your advice you would perceive the majority of the internet outside ASCII-only-land as a series of generic error messages. In general, content served by a web server is compiled from a number of different sources (HTML files, databases, plain text files, code, etc.) and quite often they do not agree on a common encoding. You can't just go ahead and play the "in a perfect world…" game with the user.

Comments are closed.

Skip to main content