Why is the default 8-bit codepage called “ANSI”?


Reader Ben Hutchings wanted to know why the 8-bit codepage is called "ANSI" when it isn't actually ANSI.

But instead of saying, "Oh well, some things mortals were never meant to know," he went and dug up the answer himself.

A quick Google for Windows ANSI misnomer found me exactly what I was looking for [pdf]:

"The term "ANSI" as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.

Comments (14)
  1. J. Edward Sanchez says:

    I remember that thread! In it, I explained in greater detail the difference between ISO-8859-1 and Windows Latin 1 (a.k.a. Windows-1252, or CP1252) — although I neglected to mention why the latter is commonly called "ANSI":

    http://blogs.msdn.com/oldnewthing/archive/2004/03/19/92648.aspx

  2. asdf says:

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/gdi/devcons_1t10.asp

    GetStockObject is missing NULL_PEN in the table for some reason.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/shell/reference/functions/dragqueryfile.asp

    DragQueryFile: In the remarks, "Note that the index variable itself returns unchanged, and will therefore remain 0xFFFFFFFF". Duh, it’s passed by value.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/commctls/updown/updown.asp

    It says "the full 32-bit range" on the bottom of the page but then it lists -0x7FFFFFFF to +0x7FFFFFFF. The full 32-bit range is actually -0x80000000 to +0x7FFFFFFF.

  3. Raymond Chen says:

    I think it’s time to make an "Unrelated comments" entry so people won’t have to hijack other entries…

    NULL_PEN: Odd indeed.

    DragQF: And yet people complain when the documentation doesn’t state the obvious.

    UpDown: I’ll have to check what the true range is.

  4. Mike Dimmick says:

    The ‘true range’ will depend on whether your processor does one’s- or two’s-complement arithmetic (although everything Windows currently runs on is two’s-complement). One’s complement has the odd property that you can actually represent -0 (it has the bit pattern 0xFFFFFFFF for a 32-bit number).

    For a bonus point – why is Windows’ use of the term Unicode also a misnomer?

    My answer: Unicode refers to an abstract, logical coding of characters and character components. The physical two-byte-code-unit encoding used by Windows 2000 and earlier is UCS-2 (Universal Character Set encoded in units of 2 bytes) while that used by Windows XP and later is UTF-16 (Unicode Transformation Format, 16-bit). The difference is that UTF-16 introduces surrogates for characters whose abstract code is greater than U+FFFF – these surrogates use two encoding units of 16 bits each to represent a single Unicode code point.

    When Windows/MSDN documentation refers to Unicode, UCS-2 or UTF-16 is almost always the meaning intended. Windows doesn’t appear to support UTF-32/UCS-4 as a possible encoding.

    To simplify (!) things, I refer to the traditional encodings as byte-oriented character sets – because there are characters encoded which only require one byte in the encoding. UTF-16 is a WORD-oriented encoding because each character requires a multiple of 2 bytes to encode (either a single 2-byte code encoding a single character, or two 2-byte codes making up a surrogate pair).

    In Windows documentation you’ll also see the terms SBCS, DBCS and MBCS (single-byte character set, double-byte character set, multi-byte character set). DBCS is really a misnomer because most DBCS sets have some characters encoded with a single byte. MBCS is a covering term for SBCS and DBCS.

  5. josh says:

    Oh boy. At least with UCS-2 you knew that one base unit = one code point. I suppose it doesn’t make that much of a difference, nowhere does Unicode guarantee that one code point is one glyph or one basic lingual concept. Sure, now you can represent every language at once, but you still have to worry about slicing things. String handling sucks.

    and fwiw, C++ doesn’t guarantee that -0x80000000 will be in the range of a 32-bit integer either.

  6. Perry Lorier says:

    and irritatingly windows is quite happy to label it’s windows charactor sets as "ascii" or "iso-8859-1" in things like email messages, or in web pages (both served by a web server, or submitted by a web browser).

    This makes non-microsoft OS’s have ?’s or square boxes appear all over the place as they encounter invalid charactors all over the place.

  7. Ben Hutchings says:

    Perry: Microsoft’s applications used to do that but they now seem to be quite consistent in using the correct names like "windows-1252". You can also choose whether the standard or proprietary encoding is used: "Western European (ISO)" is ISO 8859-1 whereas "Western European (Windows)" is code page 1252.

  8. Mike Dimmick says:

    At least with the UTF series, you can walk a string backwards. With UTF-8 (encoded using the canonical representation – it is possible to encode characters illegally) you can tell whether a unit represents a single code point, or a trailing byte, or the lead byte of a two-, three- or four-unit encoding. Single code points are always under 0x80, trail bytes are between 0x80 and 0xBF, lead bytes of a two-byte encoding are between 0xC2 and 0xDF, three-byte between 0xE1 and 0xEF and four-byte between 0xF0 and 0xF7. The pattern is basically (binary):

    0xxxxxxx = single byte

    10xxxxxx = trail byte

    110xxxxx = two-byte lead byte

    1110xxxx = three-byte lead byte

    11110xxx = four-byte lead byte

    UTF-16 uses the values 0xDC00 – 0xDFFF for the leading surrogate unit and 0xD800 – 0xDBFF for the trailing unit. These values are reserved in the logical encoding.

  9. josh says:

    Meh. Most MBCS that I know of use one or two bytes for each character and you can at least identify lead bytes. It’s only marginally more difficult to walk a string backwards in that case. And you need to scan all your strings ahead of time to make sure they’re valid UTF before you can really take advantage of it.

  10. josh says:

    Wait, no, I guess you usually can’t… You can distinguish a lead byte from ascii, but not necessarily from a trail byte. :/

  11. Because it once was, though no longer is.

Comments are closed.