How do I convert an ANSI string directly to UTF-8?


A customer asked the following question:

Is there a way to convert an ANSI string directly to UTF-8 string? I have an ANSI string which was converted from Unicode based of the current code page. I need to convert this string to UTF-8.

Currently I am converting the string from ANSI to Unicode (Multi­Byte­To­Wide­Char(CP_ACP)) and then converting the Unicode to UTF-8 (Wide­Char­To­Multi­byte(CP_UTF8)). Is there a way to do the conversion without the redundant conversion back to Unicode?

There is no multibyte-to-multibyte conversion function built into Windows (as of this writing). To convert from one 8-bit encoding to another, you have to use Unicode as an intermediate step.

Fortunately, one of my colleagues chose not to answer the question but instead responded to the question with another question:

Is the data loss created by the initial conversion to ANSI really acceptable? Convert from the original Unicode string to UTF-8, and you avoid the potential mess introduced by the Unicode-to-ANSI conversion step.

The customer was puzzled by this data loss remark:

I'm using the same code page when converting from Unicode to ANSI as I am from converting from ANSI to Unicode. Will there still be a data loss?

None of the code pages which Windows supports as an ANSI code page can express the full repertoire of Unicode characters. It's simple mathematics: Since one of the requirements for being an ANSI code page is that no single character can be more than 2 bytes, there simply isn't enough expressive power to encode all of Unicode. Now, if you're lucky, all of the characters you're encoding will exist in the ANSI code page, and they will survive the round trip, but that's just if you're lucky.

It's like converting an image from 32-bit color to 8-bit color via the halftone palette. The palette is the "code page" for the conversion. Remembering to use the same palette when converting back is an essential step, but the result of the round trip will be a degraded image because you can't encode all 32-bit colors in a single 256-color palette. If you're lucky, all the colors in the original image will exist in your palette and the conversion will not result in loss of information, but you shouldn't count on being lucky.

The customer went on to explain:

Unfortunately, my code does not have access to the original Unicode string. It is a bridge between two interfaces, one that accepts an ANSI string, and another that accepts a UTF-8 string. I would have to create a new Unicode interface, and modify all existing callers to switch to the new one.

If all the callers are generating Unicode strings and converting them to ANSI just to call the original ANSI-based interface, then creating a new Unicode-based interface might actually be a breath of fresh air. Keep the poorly-designed ANSI interface around for backward compatibility, so that callers could switch to the Unicode-based interface at their leisure.

Bonus chatter: Even the round trip from ANSI to Unicode and back to ANSI can be lossy, depending on the flags you pass regarding use of precomposed characters, for example.

Comments (19)
  1. Michael Mol says:

    Geh. Module interface barriers. Time to put deprecate notices in the header file, put the new calls in, and replace the old calls with wrappers.

    Wouldn't stop me from hearing complaints about build warnings for a while, but that problem tends to resolve itself.

  2. Gareth says:

    When you say "There is no multibyte-to-multibyte conversion function", don't you mean single-byte to single-byte conversion?

    Also when you say "Since one of the requirements for being an ANSI code page is that no single character can be more than 2 bytes", don't you mean more than 1 byte?

    I'm hoping these are mistakes because I thought I understood this subject; now I'm not so sure.

  3. laonianren says:

    @Gareth.  ANSI code pages can be either single-byte or multi-byte.  For example, code page 1252 (used in the USA) uses one byte for each character, but code page 950 (traditional chinese) uses one or two bytes for each character.

    Some code pages (e.g. code page 54936 – GB18030) can use more than 2 bytes per character but they can't be selected as "the ANSI code page" (i.e. the system default).

  4. Tim says:

    Assuming the questioner was using a specific single-byte code page (for example 1252) it is trivial to write a routine which converts an ANSI string directly to UTF-8. You need a lookup table containing the UTF-8 representations of the high-bit characters, and it's done.

  5. Ry Jones says:

    You can convert the weather into unicode easily:

    http://weather.mar.cx/

  6. For those who are reading this article and are confused by the idea of "converting Unicode to UTF-8", I must point out that when a Microsoft employee says "Unicode" without further qualification, they mean UTF-16 LE.

    e.g., "Notead | Save As | Encoding" contains these options:

    ANSI

    Unicode

    Unicode big-endian

    UTF-8

  7. David Walker says:

    The original questioner needs to understand what mathematicians call the "pigeonhole principle".

  8. Yuhong Bao says:

    "For those who are reading this article and are confused by the idea of "converting Unicode to UTF-8", I must point out that when a Microsoft employee says "Unicode" without further qualification, they mean UTF-16 LE."

    Yep, don't forget MS decided to adopt Unicode for NT before UTF-8 even existed!

    Exercise: If UTF-8 was invented in 1992, why didn't NT 3.1 released in 1993 have UTF-8 support?

  9. Marquess says:

    My answer to the original question would be something along the lines of “libiconv.“

  10. Ben Hutchings says:

    Marquess: While some iconv implementations support arbitrary conversions, there is generally no requirement that they do so – and so far as I'm aware, those that do support them convert via UTF-32 if it is neither the source nor destination encoding.

  11. Cheong says:

    Having a "superset" code page standing in the middle of conversion can greatly reduce the size of conversion library. (by reducing the pairs of codepages conversion table needed)

    For Unicode conversions you have the additional benefit of cleaner handling of varies decomposition forms.

  12. David says:

    Why didn't they write it themself? UTF-8 is not that hard.

  13. Anonymous Coward says:

    ‘I would have to create a new Unicode interface, and modify all existing callers to switch to the new one.’ – From personal experience, that is easier than either dealing with lossy conversion issues or escaping.

  14. Worf says:

    UTF-8 is a compatibility version of Unicode – designed for 8-bit clean mediums but avoiding the one byte that can cause problems – NUL. UTF-16/32 are much easier to parse and handle, and you can get nice speed boosts by converting UTF-8 to UTF-16/32 ASAP if you know your medium can handle embedded NUL bytes.

    Navigating UTF-8 is a pain also if you have to go backwards and forwards.

    But, UTF-8 is great because most legacy systems can handle it with zero modifications – they don't have to be Unicode aware.

  15. Cheesle says:

    I really enjoy your stories, but I find this to be a bit ignorant:

    "Keep the poorly-designed ANSI interface around for backward compatibility, so that callers could switch to the Unicode-based interface at their leisure."

    Why do you say poorly-designed? Has it occurred to you that the interface may have been designed 15 years ago, and at its time it was all up to the state of art, provided by its host OS at that time?

    Old interfaces may not be easily replaced/duplicated, and in any case it depends on the availability of the source.

    Just because something may be old, it is not necessarily poorly-designed.

    As for UTF-8 being great, as stated by Worf… For English yes, once you go beyond 1252, UTF-8 is not great.

  16. mdw says:

    @ Worf "UTF-8 is a compatibility version of Unicode – designed for 8-bit clean mediums but avoiding the one byte that can cause problems – NUL. UTF-16/32 are much easier to parse and handle, and you can get nice speed boosts by converting UTF-8 to UTF-16/32 ASAP if you know your medium can handle embedded NUL bytes."

    UTF-8 doesn't avoid NUL bytes at all.  Are you thinking of Sun's demented UTF-8 variant?

    For a new application, UTF-16 has only one discernible advantage over UTF-8: it's more compact at representing characters from some Asian languages (two bytes per character rather than three).  UTF-16 has the same variable-length encoding problems that UTF-8 has, only they happen more rarely so your code gets tested less well; but it doesn't have UTF-8's compatibility with old 8-bit string functions.  This goes much further than you might think: UTF-8 strings order lexicographically exactly as the corresponding sequences of code-points do, for example.  This is /not/ true of UTF-16, because characters outside the BMP are encoded with surrogate pairs, and the surrogate space isn't at the top of the BMP.

    Not all text processing jobs are faster on UTF-32.  UTF-8 has a compactness advantage — and therefore a locality advantage — on things that can be done in a single left-to-right pass (or a few of them).  There are other advantages to UTF-8 too.  Perl is no slouch at text processing; it uses UTF-8 as its internal representation.  I'm willing to bet that a good reason for this is that a lot of what it does is searching for substrings.  It uses the Boyer–Moore algorithm, which scans the needle string and builds a table: if I'm comparing with this character in the needle and I find this other character instead, I can skip over so many haystack characters because there's no hope of a match there.  In UTF-32 the tables would be enormous.  In UTF-8, it's still just 256 bytes per character position — and the search works fine on full Unicode.  Other text-processing algorithms — e.g., lexical analysers — which work by building and running a DFA need fancy table compression techniques if you're going to use UTF-32 (and handling the intermediate forms is probably awful).  In UTF-8, your DFA ends up being a little more complicated at the beginning but everything else is fairly tractable — and you can avoid the extra indirections from DFA table decompression.

    That's not to say that having a fixed-size per code-point isn't good for other jobs.  Yes, if you want to pick out the substring between characters 5 and 17 then UTF-8 sucks.  But for fixed size per code-point, UTF-32 is the only way to fly.

    Windows is stuck with UTF-16 because Unicode expanded after Microsoft had already decided to use two bytes per code-point, and UTF-16 is the compatibility path from UCS-2.  If they were starting now, I'd bet they'd choose UTF-8 rather than UTF-32, and UTF-16 wouldn't even have a chance.

    I think UTF-8 was a better choice from the beginning.  But at the time it was a sketch on Ken Thompson's napkin, for use in Plan 9 from Bell Labs.  I don't blame MS for not inventing UTF-8 themselves.  Firstly, Ken Thompson is a genius, and secondly he was designing a research-toy OS at the time.  UCS-2 was already specified; Microsoft took the conservative route, and it looked like a plausible choice at the time.  After all, Unicode was probably seen as somewhat risky at the time anyway; inventing a proprietary encoding risked being stuck with unpleasant interoperability problems (and accusations about undermining industry standards and all that, probably).  Does that answer Yuhong Bao's exercise question?

    I /do/ blame MS for producing a shiny new runtime system that calls a 16-bit quantity a `char' in 2002; it's just a lie.  They missed opportunity to leave behind what (in retrospect) turned out to be a mistake.  At least they chose the default I/O encoding right.

    Pre-emptive reply.  The internal representation inside a String' object is, or should be, opaque anyway.  If keeping it in UTF-16, or inventing a UTF-16 copy on demand, speeds up the FFI or anything else, then it can do that and nobody needs to care (benefits of a high-level RTS).  I don't care aboutString'; I care about char'.  Personally, I think makingchar' an integer type at all was a mistake: Common Lisp has adapted to Unicode with hardly a hitch, because it had an abstract `character' type.  I'm guessing they just followed Java's lead on that one.  But now I'm /seriously/ risking getting Raymond annoyed with me, so I'll shut up now.

  17. Yuhong Bao says:

    mdw: I was not suggesting that MS would have invented UTF-8 themselves when I asked this question. I agree that it would have not been a good idea.

  18. Dude says:

    Arrays of UTF8 chars (aka string) is a pita.

  19. Dude says:

    Arrays of UTF8 chars (aka string) is a pita.

Comments are closed.

Skip to main content