What’s the difference between Text Document, Text Document – MS-DOS Format, and Unicode Text Document?


Alasdair King asks why Wordpad has three formats, Text Document, Text Document - MS-DOS Format, and Unicode Text Document. "Isn't at least one redundant?"

Recall that in Windows, three code pages have special status.

  1. Unicode (more specifically, UTF-16LE)
  2. CP_ACP, commonly known as the ANSI code page, although that is a misnomer
  3. CP_OEM, commonly known as the OEM code page, although that too is a misnomer.

Three text file formats. Three encodings. Hm... I wonder...

As you might have guessed by now, the three text file formats correspond to the three special code pages. Now it's just a matter of deciding which one matches with which. The easiest one is the Unicode one; it seems clear that Unicode Text Document matches with Unicode. Okay, we now have to figure out how Text Document and Text Document - MS-DOS Format map to CP_ACP and CP_OEM. But another piece of the puzzle is pretty clear, because MS-DOS used the so-called OEM code page. Therefore, by process of elimination, Text Document corresponds to CP_ACP.

Now that we have puzzled out what the three text formats correspond to, we can address the question "Isn't at least one redundant?"

Michael Kaplan explained that ACP and OEM are (usually) different. And neither is the same as Unicode. So in fact all three are (usually) different.

In the United States, the so-called ANSI code page is code page 1252, the so-called OEM code page is code page 437, and Unicode is code page 1200. Here's the string résumé expressed in each of the three encodings.

Description Encoding Code page
(en-us)
Bytes
Text Document CP_ACP 1252 72 E9 73 75 6D E9
Text Document - MS-DOS Format CP_OEM 437 72 82 73 75 6D 82
Unicode Text Document UTF-16LE 1200 FF FE 72 00 E9 00 73 00
75 00 6D 00 E9 00

Three encodings, three different files. No redundancy.

Comments (16)
  1. Karellen says:

    72 C3 A9 73 75 6D C3 A9. :-(

  2. Kyle S. says:

    If "Unicode" means "UTF16-LE", why does it include a BOM?

    [To avoid this problem. -Raymond]
  3. Rodrigo says:

    Very clarifying.

    Many thanks.

  4. chentiangemalc says:

    Excel also has ".csv" format and ".csv (MS-DOS) format

  5. Simon Buchan says:

    I'm surprised you didn't take the opportunity to call "Unicode" a misnomer too, then you would have been 3 for 3!

    I'm surprised Wordpad doesn't support UTF-8, when Notepad does: especially since nothing on Windows will correctly detect your example OEM text file.

    [It's less surprising if you recall that Wordpad was written before Notepad had UTF-8 support. -Raymond]
  6. Joshua says:

    @chentiangemalc: Yeah, and both are broken. MAR01 is not necessarily a date but when you save it back you get 03/01.

  7. Will says:

    chentiangemalc, That is because you have a csv format that puts double quotes around strings and treats commas in the double quotes as part of the string not a separator.

  8. JM says:

    The CSV family are bad neighbours. You've got grampa Quoteless, mommy Semicolon, junior Commachallenged and cousin Singleline, and they all misbehave in various ways. I blame daddy Standard for never being around.

    I'd rather have encoding issues. Sure, those are annoying too, but at least the damage is usually fixable.

  9. And the creepy uncle pipe, who noone likes to talk about.

  10. And there is no way my grandmother would have a clue what the difference between any of these options is.  And think of the number of people who use Windows 7 and don't even know what "MS-DOS" or "Unicode" is.

  11. kme says:

    It seems like it would have been better to just explicitly name the options after the codepages, since the people who care end up decoding the euphemisms back into the codepages, and the people who don't care don't understand the euphemisms either.

  12. Wilczek says:

    Hi,

    Since Notepad was also mentioned I'd like to ask if you could quickly summarize why there was a file size limit while opening a textfile in Notepad on Win9x series? If I remember correctly Notepad could not open text files larger than 32KB (or 64KB?) on 9x, but there was no such a limit on NT4 (on Win2000 for sure not).

    Thanks

  13. lmgtfy says:

    Googled for "windows 95 notepad limits" and the first result, Wikipedia, says 64k for Windows before NT, which matches my memory.

  14. ender says:

    @Wilczek: most likely because the text control is limited to 64kB (and Notepad is pretty much just a wrapper around it).

  15. Neil says:

    Don't get me started on CSV files; SQL Server Express 2005 doesn't really know what they are, so I had to trawl my dataset looking for the one ASCII character that wasn't present in my data so I could use it as the delimiter. (Non-ASCII characters might have worked or they might have introduced their own headaches.)

    @Wilczek I'm sure Raymond himself covered this by pointing out that Notepad is simply a glorified Edit control.

  16. Horst Kiehl says:

    @kme: That might be true now, but at the time when those distinctions were introduced, it was possibly much more common for a user to know "my file came from a MS-DOS program" rather than "my file is in codepage 437".

    Also, a description like "Text Document – MS-DOS Format (Codepage 437 Text Document)" would get rather long, and then, to be able to generate these strings, Notepad also would have to determine what codepages are the current ones for the CP_ACP and CP_OEM encodings, and then it would have to put them together by the rules for the current display language. Even if this could be done with a reasonable amount of work, suddenly the number of test cases for this part of Notepad is of O(n²) instead of O(1).

Comments are closed.