How do you deal with an input stream that may or may not contain Unicode data?


Dewi Morgan reinterpreted a question from a Suggestion Box of times past as "How do you deal with an input stream that may or may not contain Unicode data?" A related question from Dave wondered how applications that use CP_ACP to store data could ensure that the data is interpreted in the same code page by the recipient. "If I send a .txt file to a person in China, do they just go through code pages until it seems to display correctly?"

These questions are additional manifestations of Keep your eye on the code page.

When you store data, you need to have some sort of agreement (either explicit or implicit) with the code that reads the data as to how the data should be interpreted. Are they four-byte sign-magnitude integers stored in big-endian format? Are they two-byte ones-complement signed integers stored in little-endian format? Or maybe they are IEEE floating-point data stored in 80-bit format. If there is no agreement between the two parties, then confusion will ensue.

That your data consists of text does not exempt you from this requirement. Is the text encoded in UTF-16LE? Or maybe it's UTF-8. Or perhaps it's in some other 8-bit character set. If the two sides don't agree, then there will be confusion.

In the case of files encoded in CP_ACP, you have a problem if the source and destination have different values for CP_ACP. That text file you generate on a US-English system (where CP_ACP is 1252) may not make sense when decoded on a Chinese-Simplified system (where CP_ACP is 936). It so happens that all Windows 8-bit code pages agree on code points 0 through 127, so if you restrict yourself to that set, you are safe. The Windows shell team was not so careful, and they slipped some characters into a header file which are illegal when decoded in code page 932 (the CP_ACP used in Japan). The systems in Japan do not cycle through all the code pages looking for one that decodes without errors; they just use their local value of CP_ACP, and if the file makes no sense, then I guess it makes no sense.

If you are in the unfortunate situation of having to consume data where the encoding is unspecified, you will find yourself forced to guess. And if you guess wrong, the result can be embarrassing.

Bonus chatter: I remember one case where a customer asked, "We need to convert a string of chars into a string of wchars. What code page should we pass to the Multi­Byte­To­Wide­Char function?"

I replied, "What code page is your char string in?"

There was no response. I guess they realized that once they answered that question, they had their answer.

Comments (27)
  1. Joshua says:

    [ The Windows shell team was not so careful, and they slipped some characters into a header file which are illegal when decoded in code page 932 ]

    Ah yes, the one I considered a compiler bug. Invalid characters in comments should not bomb the compiler. They sure as heck don't bomb gcc.

  2. Is there any reason to use CP_ACP anymore? I can see using UTF-8 (or its ASCII subset) and I can see using UTF-16, but CP_ACP on any data that might leave the machine is just asking for trouble.

  3. Henning Makholm says:

    @Joshua: What would you have the compiler do? If it cannot decode the incoming byte steam into characters, it has no way of knowing where the comment ENDS, so exiting with a polite error message would seem to be the Right Thing to do.

    GCC's behavior is appropriate for a compiler that decides to assume that the source file encoding is always "some unspecified 8-bit charset with ASCII at the bottom and something at the top", where the meaning of the somethings is not relevant except insofar as their byte values can be dumped raw into the source file for string and character constants. However, a compiler that tries to support input encodings outside this pattern cannot afford to be so cavalier. With stateful encodings such as ISO/IEC 2022, ignoring decoding errors and scanning forward to find the next pair of bytes that look like "*/" may risk finding something that was not intended as a comment delimiter at all, or may conversely miss something that WAS intended as a comment delimiter because a shift state was lost.

  4. A. Skrobov says:

    [ It so happens that all Windows 8-bit code pages agree on code points 0 through 127 ]

    How about backslash, ¥ (Yen sign), and ₩ (Won sign) ?

    blogs.msdn.com/…/469941.aspx

  5. dave says:

    @Maurits

    The only reason I can see for not using Unicode is that you have to interface with some old cruft.

    Otherwise, why not make life easy for yourself and use what has been the standard encoding for the life of the Windows NT family of systems?

    Sure, you still have a residual issue or two: UTF-8/UTF-16, endianness.  But that's nothing compared to what you used to have to care about.

  6. pcooper says:

    I find that most tools work with Unicode just fine, and just don't use it by default. And as long as you set your encodings all along your chain of applications, it all works great. But they don't use it by default for backwards compatibility. One day they may be able to change the defaults, but sadly it might not be yet.

    Also, in case anyone hadn't seen it yet, Joel Spolsky's article on this is great for explaining the basics: http://www.joelonsoftware.com/…/Unicode.html

  7. Anonymous Coward says:

    @Pcooper: Yep. All together now: ‘There Ain't No Such Thing As Plain Text.’

  8. @A. Skrobov

    In this situation they are just font face replacement for the backslash/reverse soldus. To Windows itself, it still treats it as a .

  9. No says:

    There's a much simpler approach: all byte streams are UTF-8 until proven otherwise. This rule works the vast majority of the time. In a UTF-8 world, we do have "plain text". The world outside Building 26 lives in this UTF-8 world. We can go a long, long way toward making Windows more sane by just making UTF-8 a valid multibyte encoding and switching all systems to it by default.

  10. Joshua says:

    @Henning Makholm: The C language is defined in 7 bit. If dropping chars until resync doesn't work, the file is in a completely inappropriate encoding. Obviously you can't drop chars unless the lexer state is inside comment though.

    Also, if you get decode errors in header files, there's an excellent chance the reason you're getting them is you guessed the wrong encoding. Slam it back to 7 bit in that case.

    [The C language does not define its character set in terms of bits or even ASCII. It says that (5.1.1.2) phase 1 of translation is "Physical source file mutibyte characters are mapped, in an implementation-defined manner, to the source character set." This is done before comments are stripped in phase 3. Per 5.2.1, the source character set consists of the basic character set and the extended character set. There is no requirement that the base character set be 7-bit ASCII. (EBCDIC encoding is legal, and most of the interesting characters in EBCDIC are greater than 128.) And the language permits extended characters in comments and string literals. -Raymond]
  11. Joshua says:

    @No: Well do I know it, but Michael Kaplan says it cannot be done. I say screw it, fix the core APIs and most programs will behave. If anybody needs to use a program that broke, that's what program specific encoding declarations are for (note you could decide to give either the old or the new program the specific encoding rather than the session encoding).

  12. Goran says:

    Queue UTF-8 Napoleons in 3…2…1…

  13. I like some of those dual-purpose objects, in computing and elsewhere – like the Sysinternals tools where a single EXE would deliver the appropriate tool on the DOS-based Windows family and both 32 and 64 bit Windows NT, rather than make the user figure our whether they need proxex64.exe, procex32.exe or procex95.exe – or carefully crafted sentences which have a meaning in two different language, or the entire German conversation consisting largely of the word "morgen", used both as a greeting and to mean 'tomorrow'.

    In a way, all PE executables do this, with a 'DOS stub' if run from the DOS command prompt – usually just showing an error message that it requires Windows, but occasionally having other functionality, like loading up the HX DOS extender to allow (some) Win32 code to run under DOS too. I think I once saw an EXE file which worked as a DOS diagnostic program, a bootable ISO image running the same diagnostic program standalone, and also as a Windows program which burned itself to CD to boot from – quite a clever trick.

  14. AndyCadley says:

    @No, LOL. Try telling that to 90% of the *nix world and pretty much every F/OSS application out there. Along with huge swathes of other applications written in C/C++ that absolutely and resolutely believe that one byte = one character. It's fine to say we live in a UTF-8 world if what you're mostly dealing with is really just ASCII text, but the minute you hit the multi-byte world (which happens even more with UTF-8 than with ANSI) things start to break very, very quickly.

  15. chentiangemalc says:

    I think guess work can be greatly assisted if you know the target language of data. For example if you don't know the encoding, but you know it's supposed to be Chinese, you can work out the encoding used, as evidenced by various "encoding fixers" that turn gibberish back into Chinese text.

  16. Joshua says:

    @AndyCadley: It's funny how little broke when UNIX shifted from 8-bit clean ASCII to UTF-8. Most of what broke turned out to be not 8-bit clean in corner cases and a few programs that fell down on chopped ends of UTF-8 strings (when interfacing with the old programs).

    […Per 5.2.1, the source character set consists of the basic character set and the extended character set. There is no requirement that the base character set be 7-bit ASCII…..]

    Which turns out the be technically true but irrelevant as there is no CP_ other than UTF-7 that doesn't have 7-BIT so we can exclude those. As for the strings containing embedded characters, that's why I said drop bytes if inside comments.

  17. @Joshua

    Actually not that irrelevant since it means that the compiler is codepage agnostic. This means that the compiler itself can use any codepage as long as it can handle all of the basic C characters, and UTF-7 can, so you can't even exclude that. So basically, the compiler can use UTF-8, UTF-16 or even UTF-32 as its basic character set and not have any standard issues.

    This is also why it isn't easy to drop unknown bytes inside comments. What happens if the malformed bytes include one or both of the */ of a multi line comment? You will end up with a malformed program. So if a compiler does just drop comments even if it can't parse all of it under the assumption "it is all a comment" then you will end up with the possibility of the compiler incorrectly translating the program. While it is frustrating, I agree with a compiler error in the cases where it can't successfully read everything.

  18. cheong00 says:

    @Malcolm: I don't think so… Take Google translate for example. Sometimes it's autodetect feature detects Simplified Chinese as Japanese… :P (Although I think the codepage has changed automatically in the C&P process, so what Google translate see is the bytes after it changed to UTF-8)

  19. caf says:

    “ I replied, "What code page is your char string in?" ”

    Raymond demonstrates the Socratic method ;)

  20. Skyborne says:

    After using {popular open-source database that silently mangles input by default}, and Python, I have to say I'm in favor of halting noisily on input encoding errors.  Programming is hard enough when you think you know what you're doing; going full garbage-in/garbage-out makes tracing the source of the error especially painful.

    @AndyCadley, I haven't had UTF-8 issues in FOSS in ages.  Other than cleaning up messes other people made by throwing utf-8 byte streams into "latin1" columns (which are, in fact, windows-1252) of {popular database}.

  21. joshua says:

    Wow, that's one way of doing header file switches I've never seen.

    /me wonders why.

    /me is sure it has nothing to do with encoding.

  22. When you're dealing with a really widely depoloyed product (like Windows), "vanishingly small" risks happen all the time.

    Suppose there is a bug that only hits 0.1% of users.

    If you have ten million users, that hits 10,000 people.

  23. Joshua says:

    @Crescens2k: If that happens in the example encoding, the end multibyte sequence will cause a compiler error. The probability of a malformed stream being fixed to compile by dropping bytes and then causing a runtime error is vanishingly small.

    [I disagree that the risk is vanishingly small. Consider:
    /* ❦ ENABLE_EXTRA_BOOST /* change symbol to close-comment for extra boost */
    This is a plausible pattern for configuration header files. If the ❦ gets mis-parsed as a close-comment, extra boost will be enabled when it shouldn't. -Raymond
    ]
  24. Myria says:

    One that I ran into in the past is that the ASCII code point for backslash can be a valid trailing byte in the Shift-JIS code page.  If the C compiler is not aware that your source file is Shift-JIS, and you have one of these Japanese characters at the end of a single-line comment, the compiler will consider the following line to be a comment as well.  That was a fun one to figure out.

    Personally, I use precomposed UTF-8 for application internals, then convert between that and whatever the OS expected when calling into the OS-abstraction libraries.  In Linux, it passes through.  In Windows, it converts to and from UTF-16.  In Mac OS X, it normally passes through, but sometimes needs to do precomposed<->composite conversion.

  25. Joshua says:

    @Myria: Ah yes the // bozo where invisible. I always considered that a bug in the language specification itself. Really, I can't come up with a good excuse for a language to have a syntactical element that changes code behavior based on trailing whitespace.

    Oh and this one's worse. Copy your file in binary mode to a *nix machine and the compilation results changes (because ^M and ^J are now two characters rather than one).

  26. Alex Cohn says:

    For a long while, Windows SDK had an extravagant apostrophe in mmsystem.h which caused a warning in Hebrew code page. The problem was that Visual Studio 6 insisted on using this code page and resources language (in the dialogue editor) based on the computer regional settings.

    But this was only a warning.

  27. @Alex Cohn, can you give me more information on the mmsystem.h apostrophe?  I'm looking at some versions going back as far as 1999 and I'm not seeing it (though I might not be looking correctly.) You can email me at (mateer at microsoft dot com) in case I forget to check this thread.

Comments are closed.

Skip to main content