Keep your eye on the code page


Remember that there are typically two 8-bit code pages active, the so-called "ANSI" code page and the so-called "OEM" code page. GUI programs usually use the ANSI code page for 8-bit files (though utf-8 is becoming more popular lately), whereas console programs usually use the OEM code page.

This means, for example, when you open an 8-bit text file in Notepad, it assumes the ANSI code page. But if you use the TYPE command from the command prompt, it will use the OEM code page.

This has interesting consequences if you switch between the GUI and the command line frequently.

The two code pages typically agree on the first 128 characters, but they nearly always disagree on the characters from 128 to 255 (so-called "extended characters"). For example, on a US-English machine, character 0x80 in the OEM code page is Ç, whereas in the ANSI code page it is €.

Consider a directory which contains a file named Ç. If you type "dir" at a command prompt, you see a happy Ç on the screen. On the other hand, if you do "dir >files.txt" and open files.txt in a GUI editor like Notepad, you will find that the Ç has changed to a €, because the 0x80 in the file is being interpreted in the ANSI character set instead of the OEM character set.

Stranger yet, if you mark/select the file name from the console window and paste it into Notepad, you get a Ç. That's because the console window's mark/select code saves text on the clipboard as Unicode; the character saved into the clipboard is not 0x80 but rather U+00C7, the Unicode code point for "Latin Capital Letter C With Cedilla". When this is pasted into Notepad, it gets converted from Unicode to the ANSI code page, which on a US-English system encodes the Ç character as 0xC7.

But wait, there's more. The command processor has an option (/U) to generate all piped and redirected output in Unicode rather than the OEM code page.

(Note that the built-in documentation for the command processor says that the /A switch produces ANSI output; this is incorrect. /A produces OEM output. This is one of those bugs that you recognize instantly if you are familiar with what is going on. It's so obviously OEM that when I see the documentation say "ANSI", my mind just reads it as "OEM". In the same way native English speakers often fail to notice misspellings or doubled words.)

If you run the command

cmd /U /C dir ^>files.txt

then the output will be in Unicode and therefore will record the Ç character as U+00C7, which Notepad will then be able to read back.

This has serious consequences for batch files.

Batch files are 8-bit files and are interpreted according to the OEM character set. This means that if you write a batch file with Notepad or some other program that uses the ANSI character set for 8-bit files, and your batch file contains extended characters, the results you get will not match the what you see in your editor.

Why the discrepancy between GUI programs and console programs over how 8-bit characters should be interpreted?

The reason is, of course, historical.

Back in the days of MS-DOS, the code page was what today is called the OEM code page. For US-English systems, this is the code page with the box-drawing characters and the fragments of the integral signs. It contained accented letters, but not a very big set of them, just enough to cover the German, French, Spanish, and Italian languages. And Swedish. (Why Swedish yet not Danish and Norwegian I don't know.)

When Windows came along, it decided that those box-drawing characters were wasting valuable space that could be used for adding still more accented characters, so out went the box-drawing characters and in went characters for Danish, Norwegian, Icelandic, and Canadian French. (Yes, Canadian French uses characters that European French does not.)

Thus began the schism between console programs (MS-DOS) and GUI programs (Windows) over how 8-bit character data should be interpreted.

Comments (47)
  1. Dave says:

    I presume the "^" character in the command-line example above is a typo…

  2. ^ says:

    cmd /U /C dir ^>files.txt

    Why do you use a ^ to make the > literal?

    The result is the same (in this case).

  3. Ben Hutchings says:

    Would I be right in guessing that "OEM" refers to the fact that the character encodings used by DOS were effectively chosen by the PC manufacturer (OEM) and burnt into the video card ROM?

    "The command processor has an option (/U) to generate all piped and redirected output in Unicode rather than the OEM code page."

    How does this work? Does the command interpreter pipe the output through itself?

  4. scritch says:

    What are the characters that are in Canadian French and not in European French ?

  5. Obviator says:

    So then why didn’t we get two consoles? One for running MS-DOS programs, one for doing real work (compiling etc). To avoid confusion, the former could be restricted to running 16-bit DOS executables, and the latter to 32-bit Win command line programs. (But that isn’t a necessity.)

    Having said this, I would bet that a decent implementation of console for Windows is somewhere to be found on the ‘net, using the ANSI code page and everything.

  6. Mike R says:

    There are 2: The 32-bit cmd.exe, and the 16-bit command.com.. :)

  7. Kim Gräsman says:

    As a Swede with a dotted-character name, this really messes up my console apps :-)

    Shouldn’t I be able to use the CP command, or similar, to set the ANSI codepage on my command prompt to get rid of the discrepancy?

    Can anyone provide details?

    Thanks,

    – Kim

  8. The funniest thing about the OEM code pages is who the original OEM was — it was IBM. Ours were mostly cooler and definitely covered more languages, though….

    More info on the ACP/OEMCP split at http://blogs.msdn.com/michkap/archive/2005/02/08/369197.aspx and an antecdote from Helen Custer about where the ANSI term came from here at http://blogs.msdn.com/michkap/archive/2005/03/01/382289.aspx . :-)

  9. AC says:

    Canadian French puts accents on capital letters. French French doesn’t, which leads to ambiguity in all-caps newspaper headlines (peche vs pêche vs péché) but looks more beautiful on the printed page.

    I think IBM just forgot about the Ø character. Their people were certainly embarrassed enough about the omission at the time!

  10. Chris Lundie says:

    Don’t know if this is the right answer, but Canadian French uses upper-case vowels with accents, which is kind of unique.

    Also I was surprised to learn that in French, you are supposed to put a non-break space in front of a colon. I always wondered why that character existed.

  11. Ben Hutchings says:

    Michael: I heard that MS had quite a bit of input into the design of cp437 though. IBM and MS cooperated very closely until MS decided to focus on Windows instead of OS/2.

    Obviator, Kim: Use the chcp command to change the code page used by the command interpreter and its console. You probably want code page 1252.

  12. Jonathan Perret says:

    That French capitals should not be accented, in France or elsewhere, is an unfortunately very common misconception, relayed by many a clueless schoolteacher.

    I can only point interested people over there (French only, though I doubt this is a problem for the aforementioned) :

    http://www.langue-fr.net/d/maj_accent/maj_accent.htm

    Re the console CP : UTF-8 is the future. Use CHCP 65001 !

  13. Raymond Chen says:

    The ^ emphasizes that the redirection is processed by the inner command processor.

  14. " As a Swede with a dotted-character name, this really messes up my console apps :-) "

    My last name in the original German also has an A-umlaut. Are your spelling rules the same as German ones for eliminating it?

  15. Mihai says:

    "Canadian French uses upper-case vowels with accents, which is kind of unique."

    This seems to be related to the fact that OEM code pages did not have accented uppercase vowels. So the French user lived with what he had. It was incorrect, but it was all that was available. I have seen this with other languages too.

    "Also I was surprised to learn that in French, you are supposed to put a non-break space in front of a colon."

    And not only colon. Empirical rule is "before any punctuation with two elements and inside quotes". This means before : ; ? !

    And the French use chevrons for quotes, like this « quoted » text.

  16. Mihai says:

    You can change the code page of the console using chcp and change the font to a non-raster font.

    For Latin 1 use chcp 1252 and Lucida Console.

    Then you can "type" a file and looks ok.

    You can do the same from code (SetConsoleOutputCP for output and SetConsoleCP for input).

    And small correction to the previous post "before any punctuation with two elements, and inside quotes" (comma added :-)

  17. I’ve always wondered why Microsoft has a fetish for calling its text encodings "ANSI" and "OEM", instead of using the IANA standardized names.

    http://www.iana.org/assignments/character-sets

    Doesn’t OEM stand for "Original Equipments Manufacturer" ?

    Also, isn’t the ANSI family of text encodings based on an early version of what later became standardized as "iso-8859-*" ?

  18. Wesha says:

    I just wonder who was that… ummm… insightful person in Windows development team who decided to invent YET ANOTHER codepage for Cyrillic when developing windows? We ALREADY had three — national standard KOI8-R, MS-DOS standard CP-866 and Apple’s MacCyrillic. And of course M$ couldn’t just take the national standard — or at least its own DOS standard; that’d be way too rational. You had to invent a totally new one, like we didn’t have enough already. How characteristic.

  19. Raymond: I might add that you have a UTF-8 problem in the "name" form field.

    My name is pre-entered as ‘Christoffer "Kreiger" Hammarström’ when the URL ends with "?Pending=true", and as normal when the URL doesn’t.

  20. Raymond Chen says:

    I don’t run the web site. If you have feedback about the server software, you can send it to Scott W.

  21. Re: ANSI vs. OEM:

    Don’t try to read too much into them. OEM is the DOS code page that the BIOS uses for FAT filesystems. Thus it’s set by the hardware manufacturer.

    The use of the term "ANSI" for the other codepage is probably a misnomer; it’s just a variable which can be set to any of a number of code pages.

    Re: Unicode vs. UCS-2 vs. Utf-16:

    We stil have this problem a lot. Most people don’t know anything more about Unicode than it’s more complex than ASCII was and maybe less complex than MBCS was, and it takes two bytes per character. People that I try to explain the truth to usually get a glazed over look in their eyes and ask something like, "yeah, well, that’s interesting. What do I really have to worry about?"

    The official story is that as of Windows 2000, we consider the two-bytes-per-cell encoding to now be Utf-16 rather than UCS-2 because the core rendering and UI pieces changed to deal with surrogate pairs then. On the other hand, very little software will avoid splitting surrogate pairs or combining diacritics so YMMV.

  22. Matt says:

    One time, I was surprised to find out that the Microsoft C compiler converts Unicode strings from ANSI to Unicode (or is it OEM to Unicode :-) So, if you create an array like TCHAR array[] = { 0x80, 0x81, 0x82, 0x83 }; the result won’t be what you expect!

  23. Jon Potter says:

    That’s because the definition of TCHAR changes depending on whether UNICODE is defined or not. Nothing surprising there…

  24. Norman Diamond says:

    As far as I can tell from old experiments booting Windows 9x in real mode and watching Windows 2000 etc. during cold boot and during wakeup from hibernation, there are more than one OEM code page. At least one is a US OEM code page (is that 437?) and one is a Japanese OEM code page (932). The Japanese OEM code page is identical to the Japanese ANSI code page so we don’t have problems copying between Notepad and MS-DOS command prompts in any Windows 9x or NT-based system. But this also means we don’t really know what the difference is until Mr. Chen teaches us ^_^

    However,

    > when you open an 8-bit text file in Notepad,

    > it assumes the ANSI code page

    But which ANSI code page does it assume? As you pointed out in a previous blog posting, sometimes it even assumes an ANSI code page which the user has never used and the user might not even have installed fonts for it.

    3/8/2005 2:55 PM Michael Grier [MSFT]

    > OEM is the DOS code page that the BIOS uses

    > for FAT filesystems.

    Huh????? Since when does the BIOS examine filesystem structures and look for filenames? And since when is the assumed ANSI code page for filenames chosen by a hardware manufacturer instead of the chosen by whichever language version of Windows is installed?

  25. foxyshadis says:

    The comment probably need to be clarified; OEM is the original DOS code page that was at the time burned into video memory around $C000 (I believe?). I don’t think it was in the bios, unless the video read it out of the bios, the bios just sent raw binary to the video processing chip, which used its burned in code page (actually bitmaps) to render characters, save some reserved ones <32.

    Eventually it was definitely moved into the OS, copied to ensure minimal change from the huge number of DOS apps then in existence.

    Thankfully, with unicode we can once more play text mode cards with proper suit pictures in any code page. ^_~

  26. Isaac Chen says:

    Christoffer: "It seems Microsoft most often says Unicode to mean UCS-2"

    IIRC, it should be UTF-16, at least in Windows 2000+ systems.

  27. Also, a (non-Microsoft-specific) pet peeve or mine is when someone refers to "Unicode", without qualifying *what* Unicode-encoding is used. UTF-8? UCS-2? UCS-4?

    It seems Microsoft most often says "Unicode" to mean "UCS-2".

  28. Mihai says:

    For Norman Diamond:

    Windows terminology:

    ANSI Code Page=System Active Code Page (ACP)

    OEM Code Page=default console code page

    Both of them can be double byte or non-Latin 1.

    See Michael Kaplan’s blog:

    http://blogs.msdn.com/michkap/archive/2005/02/08/369197.aspx

  29. The BIOS comment was (mostly) wrong; however the OEM code page is set by the OEM. FAT (12, 16 and 32) store non-extended directory entries in the OEM code page. I don’t believe it’s possible to change the OEM code page easily; specifically it will mess up the character sets used on your FAT drives.

    LFN directory entries are stored in two-byte-per-character encoded form. If the filename does not require a short vs. long name it is only stored as a SFN, encoded in the OEM code page. (This is from reading the FAT32 source code on NT; someone with more history should jump in here and correct my errors and fill in holes.)

    The OEM code page is associated with the hardware so that when booting between various OSes, they can all agree on the code page used for the FAT volumes.

    There is assumed to be a single system-wide OEM code page so anything reasonable like recording the code page in the FAT metadata is not done.

    You can probably have some real fun with FAT removable media in this way going to machines which have different code pages.

  30. Maxime LABELLE says:

    Jonathan Perret: That French capitals should not be accented, in France or elsewhere, is an unfortunately very common misconception

    It is worth mentioning that this is all the more true. I invite french readers to lookup the official position of the Académie Française on this topic:

    http://www.academie-francaise.fr/langue/questions.html#accentuation

  31. Purplet says:

    Is this the reason why Alt+128 gives Ç and Alt+0128 gives € ?

    Is the leading 0 a way to select the codepage when inserting a character using the keypad ?

  32. Mats Gefvert says:

    I always put AutoRun="chcp 1252" in HKEY_CURRENT_USERSoftwareMicrosoftCommand Processor.

    I don’t see why I should bother with any other code page anyway… :)

  33. John Elliott says:

    foxyshadis: The original OEM codepage was in a ROM on the MDA/CGA cards and didn’t show up in the PC’s address space at all. I believe GRAFTABL was the first time it appeared in DOS.

  34. Timo Frenay says:

    Purplet: The leading 0 is to ensure backwards compatibility. Especially in the days of DOS when tools like Character Map weren’t easily available, people would memorize Alt+xxx codes that they used frequently. To prevent chaos when they would try to enter these codes in Windows, the leading 0 was introduced for the ANSI codepage.

    (Although I am a big fan of Character Map, I have memorized the Alt+0xxx codes for most of the European accented vowels.)

  35. Neil says:

    Now if only I could set the active code page to UTF-8 in a GUI application, then I could call the A APIs avoiding all the UTF-8 to UTF16 conversions necessary to call all the W APIs…

  36. Ben Hutchings says:

    Timo Frenay: You could make life easier for yourself by using MSKLC to create a keyboard layout with AltGr combinations for the accented vowels.

  37. Jonathan says:

    You Europeans have it easy…

    in the PC originally, the character set was indeed saved in the display card’s ROM (not 0xC000, somewhere in the 0xF000 segment (remember segments?)). Cards sold in Israel had to have their ROM specially programmed to have Hebrew letters (in the encoding now known as OEM codepage 862) – Cirrus-Logic-based cards were popular partly for this reason. I remember making my own TSR to set this.

    Same goes for text-mode printers (some ads specifically said "with burned Hebrew" – HP did this a lot).

    Later, DOS added support to codepages, and you could (through some config.sys lines) re-program the display to any codepage you want. You could also upload something to the printer, though that was always third-party.

    I never figured out how to display Hebrew in a modern console app, though.

  38. Sidoine says:

    It would be a good thing if all the console program designed for Windows would switch the code page when they start, and use the system code page (1252 for example). The problem is that by default the console is configured to use a font that is not unicode, so you can’t display Hebrew. But you can change the console font to lucida console and it works (change HKEY_CURRENT_USER/Console/).

    There is still a problem: I don’t know how to write in unicode. If I try to use the unicode code page (1200), wprintf doesn’t work.

  39. Michael J. says:

    Re: ANSI vs. OEM:

    > Don’t try to read too much into them. OEM is the DOS code page

    > that the BIOS uses for FAT filesystems. Thus it’s set by

    > the hardware manufacturer.

    This is also the page used by DOS-mode printers and by countless printer drivers for printers which do not support national character set. This is also the default encoding for plain text file.

    > there are more than one OEM code page. At least one is a US OEM

    > code page (is that 437?) and one is a Japanese OEM code page (932).

    Of course. Almost any non-latin alphabet coutry has its own codepage. Open any printer manual, and see how many codepages they support.

    > It would be a good thing if all the console program designed

    > for Windows would switch the code page when they start,

    > and use the system code page (1252 for example).

    Win16 I/O is built on top of DOS I/O and uses DOS conventions. Win32 simply had to support this, because there is only one filename entry per file, not one for DOS, one for Win16 and one for Win32. Now, wait: it is one for DOS and another for Windows in FAT32. Then DOS/Win3.x apps can have the good old codepage, and Win16-LFN/Win32 apps can have Unicode.

  40. Ben Hutchings says:

    Sidoine: The standard C library is intended to work with wide characters only internally. They are always converted to multibyte characters on output. If you want to write Unicode text to the console, you have two options that I can see:

    (1) Set the console to display UTF-8 (chcp 65001) and set the C library to convert to that on output (I don’t know how).

    (2) Write UTF-16 to the console with the Win32 function WriteConsoleW. Of course you’ll need to use WriteFile instead if output has been redirected.

    "chcp 1200" results in the error message "Invalid code page".

  41. Norman Diamond says:

    3/8/2005 8:53 PM Michael Grier [MSFT]

    > FAT (12, 16 and 32) store non-extended

    > directory entries in the OEM code page.

    If you mean that software, somewhere in a filesystem layer for FAT filesystems, uses the currently set OEM code page, then I think I agree. But you say "store", so I think you mean that the recorded structures on the disk drive say what code page was used in writing the short filenames in their FATs, and that is surely wrong.

    > I don’t believe it’s possible to change the

    > OEM code page easily;

    To change it where? If you’re talking about a copy burnt into video ROM then you’re obviously right, but…

    If you’re talking about in software: When Windows 9x is booting, while in real mode before loading the GUI and protected mode drivers, it does change the code page that is in use. If you don’t have a graphical logo being displayed then you can watch the text messages and watch the character set change. After Windows 9x is already running, and also in Windows NT-based systems, you can open a DOS-style command prompt. The "US" command and "JP" command change the code page for that window. There’s also a "MODE" command that seems able to change the code page for that window, and I think I once tried to select a European code page for it but didn’t get very far.

    If you’re talking about on a hard drive: I still do not believe that the information stored in a partition says what code page was used in writing short filenames.

    > specifically it will mess up the character

    > sets used on your FAT drives.

    It messes up the interpretation of filenames that are read from FAT drives, because the FAT does not say which code page was used, and the driver has to assume that the current Windows code page should be used.

    [Omitting one paragraph that I agree with.]

    > The OEM code page is associated with the

    > hardware so that when booting between

    > various OSes, they can all agree on the code

    > page used for the FAT volumes.

    This 100% does not happen. If you don’t tell various OSes what code page you’ve been using on a FAT volume, they sure do not agree. And in OSes where you cannot tell the OS what code page you’ve been using on a FAT volume, the OS will use the code page that it’s been assuming and you end up with a partition which cannot be fully accessed by any language version of such an OS.

    > There is assumed to be a single system-wide

    > OEM code page

    Some OSes do that … as I just mentioned …

    > so anything reasonable like recording the

    > code page in the FAT metadata is not done.

    Um, I’m glad to see that you understand that (really), but then … how is it possible that you wrote the nonsense that you did two paragraphs earlier?

    > You can probably have some real fun with FAT

    > removable media in this way going to

    > machines which have different code pages.

    Same in a single machine, and same in an internal drive. For example Microsoft used to support the possibility of installing different language versions of NT4 into different partitions in the same machine, and they could all view all partitions, but they could not access all files. Microsoft did not support the same with Windows 9x but did provide some downloads to assist users who wanted to do that in Windows 95 (it was tougher in Windows 98 because the installer hunted down and destroyed the registries of existing different language versions of Windows 98). They could view all FAT partitions but could not access all files. Scandisk could corrupt some filenames and could delete others but sometimes could not finish the job of deleting a filename that it corrupted.

  42. Kim Gräsman says:

    Joshua:

    "My last name in the original German also has an A-umlaut. Are your spelling rules the same as German ones for eliminating it?"

    I’m not sure we have any steadfast rules for substitution, but the following is what you generally see used:

    ä = ae

    ö = oe

    å = aa

    Judging by your last name, that seems to match German rules, at least partially.

    Thanks everyone for the heads-up on CHCP!

    – Kim

  43. Norman Diamond says:

    3/8/2005 11:12 AM Christoffer "Kreiger" Hammarström

    > I’ve always wondered why Microsoft has a

    > fetish for calling its text encodings "ANSI"

    > and "OEM", instead of using the IANA

    > standardized names.

    > http://www.iana.org/assignments/character-sets

    One reason might be because that page gives names for individual encodings of character sets but does not give names for two or three overall categories, at least not that I could see.

    A more generic reason for people possibly ignoring that page might be that IANA seems to ignore bug reports. (I didn’t even write cynically, it was my first correspondence with them and I had no reason to disrespect them, I hadn’t even bought any broken products from them and am not subject to product tying when wanting to buy hardware, so I didn’t even think of writing disrespectfully to them. Well, that was at the time of writing the bug report.)

  44. Lars says:

    Re the console CP : UTF-8 is the future. Use CHCP 65001 !

    CHCP 65001 indeed works quite well, but whenever I tried to run a batch file (pure ASCII!) from such a console, it never worked. There no output, no error error message, and the commands are not executed.

  45. It’s pointless today.

  46. Because it once was, though no longer is.

  47. Occasionally, somebody fails to pass.

Comments are closed.

Skip to main content