Some files come up strange in Notepad


David Cumps discovered that certain text files come up strange in Notepad.

The reason is that Notepad has to edit files in a variety of encodings, and when its back against the wall, sometimes it's forced to guess.

Here's the file "Hello" in various encodings:

48 65 6C 6C 6F
This is the traditional ANSI encoding.

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding. The first five bytes are the UTF-7 encoding of the BOM. Notepad doesn't support this encoding.

Notice that the UTF7 BOM encoding is just the ASCII string "+/v8-", which is difficult to distinguish from just a regular file that happens to begin with those five characters (as odd as they may be).

The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

And as the documentation notes, "Absolute certainty is not guaranteed." Short strings are most likely to be misdetected.

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (29)
  1. Anonymous says:

    Notepad does a good job of detecting these variations, in my experience.

    However, it’s confusing when Microsoft documentation refers to UCS-2 (or is it UTF-16 now?) as "Unicode". I’ve seen a lot of people who think that Unicode means "two bytes per character", which isn’t even true of UTF-16. UTF-8 and UTF-7 are no less Unicode than UCS-2/UTF-16.

  2. Anonymous says:

    Hey, thanks for the nice explenation!

  3. Anonymous says:

    As I understand it, UCS-2 != UTF-16

    UCS-2 can only encode U+0000 to U+FFFF (2 bytes per wide char, no more)

    UTF-16 can encode all Unicode code points.

  4. Anonymous says:

    that makes a lot of sence…

    but…

    If notepad is unsure shouldn’t it ask to the user if the text is displayed correctely and if not try an other encoding?

    (sorry about the spelling)

  5. Anonymous says:

    Nate: Just wondering… did UCS-2 / UTF-16 even exist back when Unicode was in v1.0? IIRC, when Unicode first started out (and Microsoft implemented it from v1), there was only UCS-2. Which explains why they call that "Unicode" – because when they started implementing, that WAS Unicode.

  6. Anonymous says:

    Talking about Notepad, why does it reformat itself when I save, yet forget to repaint itself? It reformats itself using it’s line wrapping algorithm which is different prior to a save.

    Notepad has been like this since NT4 at least.

  7. Anonymous says:

    I think you’re right, Simon. At the time the Unicode people were thinking it would be a 2-byte encoding. Still, that was a long time ago. Even newer systems like C# are still using UCS-2 and calling it Unicode.

  8. Anonymous says:

    Unicode is nothing but a big table. Therefore nothing is really "Unicode" except the Unicode table. In the computer world we deal with encodings of value in the Unicode table, and in that way UCS-2 is a Unicode encoding as much as UCS-4 is a Unicode encoding. That is, UCS-4 isn’t "Unicode" in the same way the UCS-2 isn’t "Unicode." But UCS-2 is a perfectly valid Unicode encoding. Microsoft has chosen UCS-2 as its internal Unicode encoding, they could have chosen UTF-8 and still called it "Unicode".

    This is a very handy site dealing with UTF-8, but also addresses a lot of stuff surrounding Unicode: http://www.cl.cam.ac.uk/~mgk25/unicode.html

  9. Anonymous says:

    Search for "EM_GETHANDLE wrap" (without the quotes) on google groups for an explanation of how notepad (and other apps) implement wordwrapping with the edit control. Basically it flashes because it destroys and recreates the edit control.

  10. Anonymous says:

    Exactly Joe. In Raymond’s post above, you can see some confusion. He lists the available encodings as ANSI, Unicode, UTF-8, UTF-7, and so on.

    But "Unicode" is not an encoding. UTF-8, UTF-7 etc. are encodings. The encoding which he refers to as Unicode is properly called UCS-2.

    This is a small but nagging problem in Microsoft documentation.

  11. Anonymous says:

    Actually the encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., Shift-JIS, ANSI code page 932) and the Unicode (little-endian) encoding with no BOM.

    But in the example posted by David Cumps, Notepad did not choose between those two encodings. Notepad chose a wildly different encoding. Notepad used its usual Japanese font for display, in which it chose a total of eight items: five double-byte full-width Kanji characters; and three single-byte non-displayable characters for which it displayed three half-width black rectangles. But the encoding is not a Japanese encoding, so Notebook’s choice of characters was nonsense.

    Here’s another example, related to a recently discussed IE security bug. I think you’ll need a tool other than Notepad to create the file, a single byte with value 0x01, i.e. a Ctrl-A. Open the file in Notepad and it displays a British pound sign. If you have a Japanese font, it displays a full-width British pound sign, as if it were a double-byte character. But move the text cursor to the right and it only moves half-way through the character, because there’s only a single byte. Add some single-byte half-width characters after that, and you can see Notepad get really confused about which characters are which. Add some full-width characters and you can watch Notepad move the text cursor through midpoints of characters instead of between characters.

  12. Anonymous says:

    Yeah Joe and Nate,

    The confusing thing with Unicode is that ever one confuses its various encodings as equivalent to being Unicode.

    When in reality the Unicode(ISO/IED 10646) really defines an order set of code values assigned to characters names and properties and rules for mapping those code values and properties to glyphs.

    The collective code values and properties can be used with other unicode rules to manipulate the character entities for various writing systems and natural languages.

    Even more confusion comes from the need to translate from various UTF-X format encodings or UCS-2 code values into the full spectrum UCS-4 code values.

    In the end, all the encoding stuff just results in a precise lookup for finding the code value and character properties from the byte encodings whether they be UTF-7, UTF-8,UTF-16, UTF-32.

    I think the confusion is more of UNICODE naming / branding problem than a Microsoft documenation issue. Unicode really is the brand for compliance to the standard and not the encoding of the standard.

    Most coders do not care about UNICODE all they need to know is how do I decode/encode it and how do I detect it and what APIS work with it.

    Afer all isn’t that what the purpose of the Uniscribe APIs were. Not sure what there equivalent is in .NET yet.

  13. Anonymous says:

    Most coders do not care about UNICODE all

    > they need to know is how do I decode/encode

    > it and how do I detect it and what APIS work

    > with it.

    > Afer all isn’t that what the purpose of the

    > Uniscribe APIs were. Not sure what there

    > equivalent is in .NET yet.

    No equivalent in .NET, alas. And Uniscribe is one scary, badly documented, weak-on-samples API.

    It’s fine if you’re doing single-style text, but the moment you go for something with more to it, or start worrying about resolution independent layout, and you start biting your fingernails…

  14. Anonymous says:

    That was insightful !

    But how come other editors open the file correctly (Wordpad, emacs)? I mean, how do they come to know what encoding is it.

    And if they can figure out, then why not Notepad?

  15. Anonymous says:

    Notepad would get it right too if it didn’t employ IsTextUnicode’s documented-to-be-unreliable "statistical analysis" (IS_TEXT_UNICODE_STATISTICS).

  16. Anonymous says:

    Bertg: On why Notepad doesn’t prompt if it’s not sure: Because it’s almost never 100% sure. The file "Hi" sure looks like 8-bit ASCII but maybe it’s the single Unicode character U+6948. The UTF-8 file "Hi" might actually be an 8-bit file that happens to begin with EF BB BF ("").

    If Notepad prompted if there was ambiguity, it would be prompting an awful lot.

    You can try to override the autodetector from the File.Open dialog if you find that it detected incorrectly.

  17. Anonymous says:

    With the explosion of international text resources brought by the Internet, the standards for determining file encodings have become more important. This is my attempt at making the text file encoding issues digestible by leaving out some of the unimporta

  18. Anonymous says:

    Let’s take another look.

Comments are closed.