Discussion of how to add UTF-16 support to a library that internally uses UTF-8


A customer was incorporating an external library that internally manages file names in UTF-8. They wanted to add UTF-16 support¹ and to avoid having to make a lot of changes to the library to change its internal data format, they figured it'd be less risky to keep the internal format as UTF-8 and convert the file names from UTF-16 to UTF-8 as they enter the library, and convert them from UTF-8 back to UTF-16 whenever the library needs to call out to Windows (for example, by passing the file name to the Create­File function).

The customer wanted to know if there were any pitfalls to this approach. In particular, is it guaranteed that converting a UTF-16 string to UTF-8 and then converting back to UTF-16 will result in a string that is byte-for-byte identical to the original?

Shawn Steele replied that the conversion is reversible, provided that the original UTF-16 string is valid. He cautioned that sometimes people are under the false impression that a UTF-8-encoded string or a UTF-16-encoded string can contain arbitrary binary data. As a result, they end up passing things like unmatched high and low surrogates (for UTF-16) or improper continuation bytes (for UTF-8). There might also be incorrect substring or string concatenation algorithms which expect that a string can be chopped at any point and produce a meaningful result.

He also pointed out that many characters have multiple encodings. For example, "Ä" can be encoded as the single code point U+00C4 (LATIN CAPITAL A WITH DIAERESIS) or as the sequence of code points U+0041 (LATIN CAPITAL A) followed by U+0308 (COMBINING DIAERESIS).²

The customer thanked Shawn for his advice. They had already encountered the second problem (known as Normalization), but since the Windows file system does not perform normalization, they figured their program shouldn't do it either. They were a bit concerned about the issue with substrings and string concatenation and were wondering if this was a case where _mbscat_s would be used instead of strcat_s.

This led to an extended discussion about surrogate pairs, zero-width-joiners, extended grapheme clusters, and stop ascribing meaning to Unicode code points.

I stepped in and tried to return to the customer's question. All of these issues with substrings and concatenation and extended grapheme clusters are issues for the library itself, not for the UTF-16 wrapper the customer is building. If there are any problems in the library, they can raise them with the maintainers of the library.

The customer wanted a UTF-16 entry point to the library which forwards to the existing UTF-8 entry point. In that case, they can call Multi­Byte­To­Wide­Char with the MB_ERR_INVALID_CHARS flag and return an appropriate failure if the string is not well-formed. If the string successfully converts from UTF-16 to UTF-8, then it will also successfully convert back from UTF-8 to UTF-16 with no loss of fidelity.³

¹ In the context of Windows, Unicode strings are encoded in UTF-16LE if not explicitly called out otherwise.

² I find it somewhat quaint that names of Unicode code points are written in all-caps.

³ The customer thanked both Shawn and me for our assistance, even though my contribution was basically to take the long discussion and focus the answer to the customer's actual problem. I confessed that I didn't add any information; I merely deleted the distractions. Shawn replied, "Yea, but you're better at saying things shorter than I can."

Comments (21)

  1. DWalker07 says:

    It's hard to keep technical discussions short. I have trouble keeping e-mails short, mainly because I anticipate possible counter-arguments and I try to address them in the middle of what I'm trying to say. I need to stop doing that.

  2. pc says:

    Deleting is often more useful than adding, especially if it accomplishes the objectives faster.

    I love this story of writing negative 2000 lines of code in a particularly productive week: http://www.folklore.org/StoryView.py?story=Negative_2000_Lines_Of_Code.txt

    1. CarlD says:

      @pc Nothing gives me greater pleasure in writing code than adding functionality while simultaneously removing code and increasing readability. That's always a highly productive day!

  3. dmitry_vk says:

    This reminds me of WTF-8 encoding (https://simonsapin.github.io/wtf-8/) which was invented because Windows allows ill-formed UTF-16 in file names and unix-originated software likes to use UTF-8 internally.

    1. Joshua says:

      Well Unix software has to deal with filenames that are invalid UTF8 so we're even.

      1. florian says:

        Not quite even, I think? Overlong UTF-8 sequences ("/" → 0x2F → 0xC0AF → 0xE080AF → ...) may cause havoc on *nix. There's no problem on Windows, where filenames with invalid UTF-16 sequences (such as unmatched surrogates) are harmless.

        1. Kakurady says:

          You can't actually code your *nix software to expect UTF-8 filenames, because, for example, ext2/3/4 allows any C-string as filename as long as it doesn't contain '/' (or so I heard), without regard to character encoding.

  4. Aur Saraf says:

    Next time, simply point them at http://utf8everywhere.org#windows

    1. Harry Johnston says:

      I'm dubious - that document ignores the fact that real-world software often has to be able to deal with malformed UTF-16 strings, e.g., Windows file names. (This might or might not be a problem for the software that was being discussed; it depends on the context.)

      Personally I'm of the opinion that filenames aren't text, and shouldn't be treated as text. You might want to convert them to text, e.g., to display them to the user, but internally they should remain 8-bit or 16-bit strings as appropriate to the OS.

    2. Ben says:

      What's so great about UTF-8? Sure it has some nice features, but *everywhere*?

      Java, JavaScript, C#, Visual Basic 4+, VBA, VBScript all use UTF-16 as their internal representation. Modern C and C++ has UTF-16 string literals and UTF-16 strings. Why not UTF-16 everywhere?

      1. cheong00 says:

        That's because we live in a weird world that the most popular CPU type (x86 based) is little-endian, but pretty much anything else is big-endian. So if your data need to travel across different system type, you need to cater for UTF16LE/UTF16BE difference.

        On the other hand, UTF-8 is endianness independent so it'll save you some headache on that.

        1. cheong00 says:

          http://unicode.org/faq/utf_bom.html#utf8-2
          Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian?
          A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. [AF]

        2. Ben says:

          ARM is also little-endian and according to Wikipedia always was.

          PDP-11, IBM Power and Z series, and SUN Sparc were/are big-endian though, so as a consequence many network standards deriving from the unix and mainframe worlds specify big-endian format on the wire or on disk.

        3. GEO255 says:

          Oh, that's what the "LE" means. I thought it meant "Limited Edition"...

          1. D-Coder says:

            "Limited Edition" is for cars.
            "Little Endian" is for chars.

  5. Someone says:

    That page answers this question (which is really only about Unicode encoding, and not Windows-related) quite well:
    http://unicode.org/faq/utf_bom.html
    Normalization and such is irrelevant here. You just have to deal with 21bits per character that are encoded in two different ways.

  6. GWO says:

    I find it somewhat quaint that names of Unicode code points are written in all-caps.

    Hey, there's no point putting a set of names to characters if you can't write those names using BCDIC :)

    1. ErikF says:

      Not having lower-case letters is part of a long and distinguished history: Latin and Ancient Greek didn't have them either (and Ancient Greek went one better and often didn't use vowels in written form!) Unicode wants to be distinguished too. :-)

  7. Someone says:

    "is it guaranteed that converting a UTF-16 string to UTF-8 and then converting back to UTF-16 will result in a string that is byte-for-byte identical to the original?"

    The misleading thing here is "byte-for-byte identical ". Converting back and forth between UTF-8 and UTF-16 will guarantee a transparent character-by-character transfer. But invalid Unicode strings may screw things up even when the library would use UTF-16.

  8. Alex Cohn says:

    If (or rather, because) some WCHAR strings are actually broken UTF-16, I would suggest some other encoding that will not rely on UTF compliance of input, e.g. convert the file names to hexadecimal strings on input. This will solve the question of "\x00C4" vs. "\x0041\xU+0308": one file name will be represented as "00C4", the second - as "00410308".

  9. French Guy says:

    What would it take for Windows to enforce UTF-16 compliance in its file names (no unpaired surrogates)?

Skip to main content