Why does misdetected Unicode text tend to show up as Chinese characters?


If you take an ASCII string and cast it to Unicode,¹ the results are usually nonsense Chinese. Why does ASCII→Unicode mojibake result in Chinese? Why not Hebrew or French?

The Latin alphabet in ASCII lives in the range 0x41 through 0x7A. If this gets misinterpreted as UTF-16LE, the resulting characters are of the form U+XXYY where XX and YY are in the range 0x41 through 0x7A. Generously speaking, this means that the results are in the range U+4141 through U+7A7A. This overlaps the following Unicode character ranges:

  • CJK Unified Ideographs Extension A (U+3400 through U+4DBF)
  • Yijing Hexagram Symbols (U+4DC0 through U+4DFF)
  • CJK Unified Ideographs (U+4E00 through U+9FFF)

But you never see the Yijing hexagram symbols because that would require YY to be in the range 0xC0 through 0xFF, which is not valid ASCII. That leaves only CJK Unified Ideographs of one sort of another.

That's why ASCII misinterpreted as Unicode tends to result in nonsense Chinese.

The CJK Unified Ideographs are by far the largest single block of Unicode characters in the BMP, so just by purely probabilistic arguments, a random character in BMP is most likely to be Chinese. If you look at a graphic representation of what languages occupy what parts of the BMP, you'll see that it's a sea of pink (CJK) and red (East Asian), occasionally punctuated by other scripts.

It just so happens that the placement of the CJK ideographs in the BMP effectively guarantees it.

Now, ASCII text is not all just Latin letters. There are space and punctuation marks, too, so you may see an occasional character from another Unicode range. But most of the time, it's a Latin letter, which means that most of the time, your mojibake results in Chinese.

¹ Remember, in the context of Windows, "Unicode" is generally taken to be shorthand for UTF-16LE.

Comments (34)
  1. Vitor Canova says:

    To complex to my understand but good to see that there is a good explanation.

  2. Joel on Software's excellent treatment of character encodings and Unicode would probably help understand as well - http://www.joelonsoftware.com/.../Unicode.html

    Separately: I was under the impression that Windows "Unicode" was UCS-2, because UTF encoding didn't exist when NT was minted.  Is that inaccurate?

  3. Zan Lynx' says:

    It used to be UCS-2 back when Microsoft thought 16,384 characters was enough for everyone. It wasn't though, so they had to change over to UTF-16LE. Which, in my opinion, was a horrible result. Now they have the worst of both worlds and none of the benefits.

    Well maybe one benefit: programmers generally know right away if they forgot to convert ASCII to UTF-16, when with UTF-8 the program will go along just fine until it hits a land mine.

    UTF-16 still has programmers thinking that they can jump ahead 5 characters or trim a string at 80 characters just with array indexes though.

    And I only know one person (it isn't me!) who understands how to figure out how long a Unicode string is in display characters as opposed to code points (if I have those terms right) because there are weird things like combining diacriticals and non-displayed hyphens.

  4. David Crowell says:

    Zan,

    Even the .NET framework doesn't properly handle indexing a string with anything other than 16-bit characters.

  5. dave says:

    >when Microsoft thought 16,384 characters was enough for everyone.

    "Microsoft" thought that because the Unicode consortium told them it was so.  For a short time in the early 90s, we lived in a world of bliss (as opposed to when I worked at DEC and lived in a world of BLISS) where there was one character encoding and all characters had the same size.

    I used to hold the opinion that 16-bit characters was appropriate, but UTF16 has convinced me that I am misguided. UTF8 is a better encoding.  But this author expresses it better than I can: http://www.theregister.co.uk/.../verity_stob_unicode

  6. Joshua says:

    I was not expecting Raymond to summon this one again.

  7. Nico says:

    @dave

    That Register article is fantastic.

  8. Cesar says:

    To make things worse, while on Windows "Unicode" is UTF16-LE (sizeof(wchar_t) == 2), on Unix it often is UTF32 (sizeof(wchar_t) == 4). It gets better with C11 and its separate char16_t and char32_t types, but given how long it took for Microsoft to adopt C99, I'm not hopeful.

    With UTF32, mojibake would be hard to happen, since modern Unicode is (IIRC) 21-bit, and most misdetected text won't have NUL bytes. On the other hand, it's quite wasteful in its space use (while being simpler for not having to use surrogates for codepoints outside the BMP); it's no wonder Unix programmers tend to use UTF-8 instead (or, when using some portable frameworks like Qt, UTF-16).

  9. Raphael says:

    Screw it, I'm just gonna use UTF-32.

  10. Crescens2k says:

    I hold the opinion that the real issue with UTF16 is one of education more than anything else. Even before UTF8, even before they added surrogates and created UTF16, Unicode was not a fixed length encoding. The combining accents starting at U+0300 were in Unicode version 1. So from the start you should have seen even UCS2 and UCS4 as a variable length encoding.

    The real thing that changed between UCS2 and UTF16 was that instead of just being variable size in how many code points there were per display character, it also became variable in how many code units were used per code point. What was more problematic was that due to compatibility, they couldn't just throw the UCS2 encoding away but instead extend it somehow. This was why surrogates leaked in rather than a new encoding for 16 bit code units happened.

    The reason why I say, and always believe that this is more a matter of education than the UTF16 encoding is that people somehow don't get properly taught the semantic changes between ANSI and Unicode, probably somehow believing that even Unicode has a one character per code point mapping. Rather than this, new programmers get to Unicode because of a compiler error and get told "change char to TCHAR" and things like that. If they don't get taught that Unicode is different, then how are they to know that it is. If they did get taught that Unicode is variable length then they wouldn't get caught off guard with surrogates or with the concept of more than once code point per character.

  11. DispatcherLock says:

    @dave

    Which did you like better, Pillar or BLISS?

  12. Mark S says:

    Wow.  After today I have a clear picture of Unicode vs. UTF-8, UTF-16, and what's going on under the hood in Windows and .NET.  It isn't a pretty picture, but all those little fuzzy pieces have been resolved.  Lack of a time machine really bites MS again, doesn't it.  You'd think MS Research would have given it a higher priority

  13. Azarien says:

    Treating UTF-16 as UCS-2 works well most of the time. Not to mention that most users don't really need ancient Egyptian hieroglyphs.

  14. Crescens2k says:

    @Azarien:

    Well, it isn't really that much of a choice unless you properly check for illegal UCS2 sequences, like the surrogate pairs, and show them as bad characters. But you are then suddenly not treating UTF16 as UCS2. But while most users don't need Hieroglyphs, East Asian users could want to use plane 2 for the extra CJK characters that could appear in names. There are also other scripts from minor languages there, so it is very short sighted to just ignore anything above the BMP if you plan to deal with international clients. There is also the fact that more and more symbols are creeping into plane 1, like the musical notation and I am sure that even more dingbats will be going there. Also let's face it, who wouldn't want to be able to properly handle the pile of poo character (U+1F4A9).

  15. Muzer_ says:

    @Cesar where does Unix use UTF-32? I have to say I've not come across it. Everything I've seen uses UTF-8 with a handful of programs in the same position as Windows, using UTF-16 for historical reasons (and usually failing at it - try to type a pile of poo in Konsole for example...)

  16. Cesar says:

    @Muzer: All the wcs* functions (standard wide char functions) in C and consequently std::wstring and friends in C++ use a 4-byte wchar_t, thus UTF-32. But yeah, as I mentioned Unix people tend to not use wide chars, they use UTF-8 most often, with Qt and Java programs (and a few others) mixing it with UTF-16 (as you said, for historical reasons; the consensus seems to be that UTF-8 makes more sense for new frameworks).

  17. Count Zero says:

    I think the "best" encoding for non-Asian users is UTF-8. (And it even works for Asian languages.) In case of basically Latin characters UTF-16 is a waste of resources - not to mention UTF-32. Every one of us have seen those files in hexdump with those 0x00 bytes. That thing almost make my eyes bleed every time.

    Their only benefit would be their indexability, but since separate accent modifier characters (like U+005E - or "add a circumflex character to the following character"), and code points above 0xFFFF renders this property invalid. So you still can't get the 5ᵗʰ glyph of a string by getting its szSample[4] character. Those days are poorly over now. And our only solace is the increasing processing capacity that makes the more complex string operations affordable.

    In contrary UTF-8 is more space efficient which makes parsing faster too. The only sad thing is UTF-8 came too late. NT kernel is built with UTF-16 in mind, and lots of our developer tools too. I can hope that some time in the near future Windows will have a chance to correct this.

  18. Crescens2k says:

    @Count Zero:

    I'm not going to say much about your opinion on what the best encoding is, since that is obviously subjective. The only thing I will mention is I feel you are wrong with one statement.

    "In contrary UTF-8 is more space efficient which makes parsing faster too."

    This would only really be true for code points in the ASCII range. Code points U+0080 and above would need extra work, while UTF-16 means that you don't have to worry about this until code points U+10000 or above. So while it is less space efficient in the ASCII range, for code points outside of the ASCII range UTF-16 has less work to do per 16 bit code unit. Since it is then going to be scanning two bytes at a time linearly through memory, besides the higher probability of a cache miss, the processor will read the two bytes just as fast as it reads one.

  19. Count Zero says:

    @Crescens2k - I would happily agree if there weren't those pesky separate modifier characters that could (and will) extend the character width unpredictably. Since they are there and valid even in UTF-32 there is no guarantee that you can read an entire glyph with one operation. This makes code reading UTF-16 characters glyph-by-glyph (which IMHO is actually the way you need them) as complicated as code reading UTF-8 characters.

  20. Kevin says:

    @Crescens2k

    It depends.  UTF-8 can represent ASCII in one byte, while UTF-16 uses two.  OTOH, UTF-16 can represent all BMP characters in two bytes where UTF-8 uses three for U+0800 and up.  For non-BMP stuff, they're equivalent in terms of space usage.  So UTF-16 is only better in the U+0800 to U+FFFF range.  As it happens, this range appears to contain all of the CJK blocks as well as a number of other scripts.  Most European scripts are below U+0800, though.

    If you're doing HTML or anything else built out of ASCII markup characters, UTF-8 is usually more space efficient in practice.

  21. kantos says:

    AFAIK, Windows use of UTF-16 as a primary character encoding is purely historical. Were it not for the fact that when the original NT was coded unicode specified the UCS(a 16 bit encoding) as the only serialization, then MS would likely be using UTF-8. But NT predates UTF-8 by a few years and switching now would break the golden rule of compatibility (although personally I would be in favor of having a set of U8 apis I don't think it will ever happen, that and missing [non-existent and not planned to be implemented ever] CRT support for UTF-8).

  22. Adam Rosenfield says:

    The whole argument about how many bytes the different encodings use for different types of scripts is pointless, because if you care at all about storage size, just compress it using your favorite compression method and be done with it.  Both UTF-8 and UTF-16 compress about equally well [1].

    [1]: utf8everywhere.org

  23. Crescens2k says:

    @Count Zero:

    The whole issue is that you are more likely to have to reconstruct the modifiers with UTF-8. For example, if you come across say a À. In both ways of encoding this UTF-8 will have to take more time. With the code point itself it is higher that 0x7f, so it would be two bytes, for the combining form, the combining grave is above 0x7f and so it would be two bytes. This means that UTF-8 parsers would have to do more work.

    @Kevin:

    I'm not debating whether UTF-8 is more space efficient. I'm saying that everything outside of the ASCII range is more efficient to process in UTF-16 until you have to do something outside of the BMP.

  24. voo says:

    @Adam Rosenfield: That's a strange argument. Pretty obvious if you're handling strings in your application you want to do something with them, from simple things like displaying to searching or indexing them. All of those operations can't be done on your compressed text which makes it bloody useless for anything but long term storage - which is generally uninteresting for the vast majority of applications.

  25. Adam Rosenfield says:

    @voo: I'd wager that in the vast majority of applications, the difference in memory footprint between storing uncompressed text as UTF-8 vs. UTF-16 is rarely more than +/- a few MBs (or maybe a few tens of MBs) [citation needed], which shouldn't be a concern on modern hardware.  Text-heavy applications running on low-memory embedded or mobile devices might care, and certainly specialized applications like word processors and databases care, but business apps running on modern PCs really shouldn't care about those few MBs.

  26. Muzer_ says:

    The space argument is really not one that is sane to use nowadays except in specialist uses, either for or against UTF-8. The main argument for UTF-8 is the ability to use the fact that it can be used with exactly the same string functions that people have been using for years without having to mess about with multi-byte chars, and can also be used with any old API call that expects a string of indeterminate encoding as long as it doesn't do too much more difficult text processing on it (like a "delete one character" type thing). UTF-8 really can be treated simply the same as an ordinary null-terminated-string-of-bytes in a very large proportion of use cases, which is its main strength.

    Maybe this is less important for Windows users because of the long history of UTF-16 support, I don't know - I'm not a Windows programmer. But it's certainly a very good argument for its use on networks, the web, etc. - existing stuff won't break horribly when they see UTF-8.

  27. Count Zero says:

    @Adam Rosenfeld - You say "the difference in memory footprint between storing uncompressed text as UTF-8 vs. UTF-16 is rarely more than +/- a few MBs (or maybe a few tens of MBs) [citation needed], which shouldn't be a concern on modern hardware.", I say we are not really thinking in terms memory or storage footprint nowadays. We are thinking in terms of bandwidth. And if you have ever tried to debug a server application and download a 4Gb log file - using a 3G mobile connection while you are on vacation (wasting your entire roaming data plan) - which would easily fit in 2Gb if it hasn't been encoded with a space-waster encoding, you would know the struggle.

  28. Kevin says:

    @Adam Rosenfield: What about cache misses?  Even saving a few MB may be worth it when we're talking about an actively running program.

  29. Adam Rosenfield says:

    @Count Zero: Yes, bandwidth is an important consideration.  If you're transmitting a 4 GB log file, then you should absolutely be using compression in any case, either at the application level (compress before encoding) or at the protocol level (e.g. use a compressed Transfer-Encoding with HTTP such as by passing --compressed to curl, or use -C with scp, etc.).

    @Kevin: That's a good point.  You all have convinced the encoded size argument is not totally moot.  I'm a strong proponent of UTF-8 everywhere (in case that wasn't clear from the link I posted earlier), which in most cases wins the size comparison with UTF-16.  Only in the rarer case of dense Asian text could I be convinced that UTF-16 is a superior encoding.

  30. Joshua says:

    From the day that Unicode broke the 16 bit barrier and their promise, it was inevitable that an encoding of Unicode in char * would predominate. Indeed UTF-8 had not yet arisen when MS wrote NT; however UTF-1 was already published. Don't get me wrong, UTF-1 was a poor encoding; nevertheless it showed what should have been. Once having seen UTF-1, the existence (although not the form) of a correct design was obvious.

    All who use UTF-16 as the primary format are in the state of being legacy code.

  31. Crescens2k says:

    @Adam Rosenfield:

    The thing is, as one smart person said, the actual performance of a program isn't truly known until you run it. I wrote a little program which created arrays of one and two bytes and then did something per array element to test how long the program would take.

    The times below the following array sizes were returning 0ms.

    Test completeArray of size 134217728 of element size 1

    Access took 15 milliseconds

    Test completeArray of size 268435456 of element size 1

    Access took 31 milliseconds

    Test completeArray of size 536870912 of element size 1

    Access took 47 milliseconds

    Test completeArray of size 1073741824 of element size 1

    Access took 94 milliseconds

    Test completeArray of size 67108864 of element size 2

    Access took 15 milliseconds

    Test completeArray of size 134217728 of element size 2

    Access took 16 milliseconds

    Test completeArray of size 268435456 of element size 2

    Access took 47 milliseconds

    Test completeArray of size 536870912 of element size 2

    Access took 109 milliseconds

    Test completeArray of size 1073741824 of element size 2

    Access took 172 milliseconds

    Running multiple times didn't change things too much. So the access time is bounded by the size of the data you are processing, not cache misses. What's more it doesn't really become noticeable until you get above a certain threshold.

    So the question is, how often do you have to process documents that have more than 67 million code points. Also, while in this case the only thing I did was to touch the memory, how fast or slow is the processing in comparison with the memory access.

    With the linear time nature of the access, the encoded size argument is not, IMO, that strong either. It would take a lot to convince me that the extra ~90ms at the largest test size is really such a bank breaker.

    The more I read on the issue, the more I am convinced that the entire UTF-8 vs. UTF-16 debate is actually moot in most cases.

  32. remis says:

    Guys,

    you can't imagine how "creative people" can use this "feature".

    3 years ago I found a suspicious index.html on some web site. It was like valid HTML but right after </html> there were many strange Chinese characters. Even if they were valid Chinese chars, text was totally meaningful.

    Then I noticed if I change encoding to some 1 byte encoding that HTML turns into Chinese chars but these Chinese chars below become <script>a popular js virus at that time </script>.

    Antivirus program did not detect the virus in index.html.

    I submitted this to: http://www.microsoft.com/.../submissionhistory.aspx

    and they said "no infection detected".

    But imagine possibilities considering there is a complex logic in browser to choose encoding for html page when HTML is not strictly formed: current locale, BOM header...

    On other side antivirus also has some logic detecting locale when scanning that html file before browser can access it.

    There is a chance that antivirus and browser logic differs and wow! Here is a way to bypass antivirus and execute malicious script inside the browser.

  33. remis (again) says:

    I didn't want to make a Day 0. I just wanted to make it Day -7.

    Dear Moderator, did you check it's safe with Malware Protection Center?

  34. remis (still alive) says:

    =Correction:

    On other side antivirus also has some logic detecting locale when scanning that html file before browser can access it.

    =Should be changed to:

    On other side antivirus also has some logic detecting encoding when scanning that html file before browser can access it.

Comments are closed.

Skip to main content