Disclaimer: This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me. I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages.
Taking a snapshot of the current windows code pages (or any other code pages), one can wonder how some of these code pages ended up in their current state. We also wonder about other things such as peculiarities of a function call and other related behavior.
It is important to remember that modern computer systems evolved from earlier systems and “we”, as in the entire computer science community on the planet, have learned a lot since the beginnings of computer science. Rarely do we get the chance to start with “a clean slate” and redesign APIs or systems. Even when we do, we only have our best intentions and previous lessons to learn from, and sometimes those new designs prove to have weaknesses that weren’t originally seen.
In DOS days of PC history, “code pages” were the bytes used to directly print to the console. Apple, Commodore, IBM, and probably many others used bytes to map to a character on the console. (Before that there were the values that showed up on Teletypes or punch cards, but I’m kind of focusing on Windows history). The US and “western” cultures seem to have had a great influence on the development of early PCs, and the ASCII standard was very common. Many future behaviors were based on ASCII or similar work.
ASCII only specified 7 bits of information, but since PCs had 8 bits most manufacturers extended the code pages to provide additional glyphs, such as diacritics or additional scripts (besides latin). This provided the ability to represent many languages, but at a hidden cost of data portability. Since most data was confined to single companies and global exchange of data wasn’t a primary concern this wasn’t a big problem at first.
Additionally, since these bytes were used to render glyphs on the screen it seemed wasteful to ignore the non-printable control sequences from 01-1f, so smiley faces, hearts, spades and the like were added.
Users Want Their Glyphs:
As computing evolved users wanted more glyphs and several techniques evolved to solve that problem.
- Font changing was used by some systems (and continue to be used in some cases). Early DOS PCs effectively changed the font used for the display when they changed the “OEM code page”. Once multiple font use became common, this technique evolved to allow multiple glyph sets to be displayed in a single application merely by changing the font. In modern systems Unicode provides a Private Use Area (PUA) for users to stick their custom glyphs, but font hacks continue to be used. The PUA solution doesn’t work on the console or for ANSI applications, so some groups have created font hacks that render the desired glyphs, yet their system uses a code page with different characters than those the font displays. The adoption of this technique ranges from users “playing” with the invented Klingon script to national “standards” attempting to make computers work for them where OEMs have been slow to create fonts or other solutions.
- Switching fonts is effectively built-in to the ISCII standard. The idea is that escape sequences are used to select which font is to be used for the 8th bit character ranges. Originally this included the idea of simple transliteration (by merely changing the rendered font), but this doesn’t seem to be used much in practice. This technique sort of standardizes the use of the font changing technique. This is obviously an evolution beyond the early PCs that could only display a single font.
For CJK (Chinese, Japanese & Korean) scripts, 8 bit fonts aren’t enough. CJK code pages are usually still ASCII compatible, but they’ve evolved other techniques for rendering additional characters.
- Double byte code pages have the idea that a specific range of bytes are “lead bytes” and are to be followed by a “trail byte”. Combined the lead & trail bytes provide many additional characters. CP 932, etc. are examples of this technique.
- This idea was extended by GB18030 to provide additional lead bytes that indicate 4 byte sequences, allowing even more characters to be encoded.
- Shifting code pages are similar to ISCII in that they select additional modes. They typically “start” in an ASCII-like code page, but particular escape sequences cause the following bytes to be interpreted according to other rules. Generally these provide additional two byte or single byte sequences. Note that the shift sequences are typically single byte, even when currently in a double byte mode.
Evolution of the Character Repertoire:
In addition to the evolution of techniques, the repertoire of supported characters has been evolving. Unfortunately the drivers of this process are rarely coordinated across the industry. As a need for a new character becomes apparent, organizations add it to the standards that they influence or control, but this doesn’t guarantee adoption across the industry, particularly if they don’t coordinate with other standards or organizations.
This repertoire evolution can cause the behavior of code pages to evolve as well. For example, the Euro was invented well after the creation of ASCII and many of the many other code pages. Obviously it was needed, so it was added to most code pages, squeezing into unused spaces where possible. For single byte code pages that could mean replacing a previously rarely used code point. Of course if a vendor used that rarely used code point for something special in their application, then this caused behavioral changes.
For other standards the repertoire evolution has meant evolving iterations of the standard. Several organizations add characters to their standards, but it can take a while for those to make it to the font vendor or other level necessary for complete support. Shifting standards can also change existing user data or private use behavior, so supporting new standards isn’t always a trivial undertaking.
Some character sets have been complicated by standards dependencies. For example if a desirable standard assigns a bunch of characters and users want Windows support, then Windows has to find space in Unicode since Windows is Unicode based. In the best case the desired characters are already assigned to Unicode so windows can “just” add font support (not necessarily trivial) and is good to go. Historically however, characters are usually created by some other authority and may take a while to get official Unicode support. In those cases, the characters can remain unsupported, or someone can add PUA characters to support them until Unicode supports them.
If PUA characters are used to temporarily support additional characters, then there are additional problems when they are added to Unicode since existing data will need to be migrated from the PUA to the actual Unicode code point. Migration may also be complicated by the fact that all users may not be able to upgrade at the same time.
Another problem impacting the way code pages behave is how (and when) they’ve been implemented. Occasionally standards have had errors that were corrected in later versions. Other times a platform vendor may have interpreted the behavior in an unexpected way. Sometimes a font vendor for a common font could make an error with a code point. Additionally users may commonly confuse a glyph with a similar glyph and abuse the existing standard.
All of these contribute to variations in the way code page data is handled. Once data is coded in a particular way, correcting the data may be complicated. It can be easy to identify an implementation bug and find the “correct” solution, but making the fix can break existing behavior or data portability.
For historical reasons there are also some oddities in encoded data. Remember that code points were often merely glyphs on the computer screen? And those glyphs depended on the rendering of that machine? Well DOS used the character to delimit folders on the file system. CJK users however wanted to be able to type their currency symbol on their machines. Since people don’t use very often, it got replaced with the appropriate currency symbol on Asian machines. Internally it was always 0x5C however, and the machine always used that byte value to delimit folders. The end result is a mess where 0x5c doesn’t convert to Unicode very well, where users have different file system delimiter characters, and where fonts end up hacked to render ¥ instead of if you have a certain system code page. This is obviously really undesirable, yet it is pretty obvious how this happened and pretty difficult to “fix” at this time.
I find it helpful to remember this stuff when confronted with another code page oddity. One of my goals is to reduce any further complexity in this evolutionary tree of code pages. It is often “clear” what the desired or proper behavior should be when you consider only current standards or when you know the lessons we’ve always learned. With a living system of data it isn’t always possible to get from the current state to the perfect state in a simple manner without causing pain to some users. In that case I try to limit the long term pain, and reduce the problems to as few users as possible.
Similar examples exist for nearly all API sets and programming languages, OS’s and techniques. The global “we” of computer science has learned a lot and continues to learn a lot, but sometimes it’s helpful to remember how an API may have evolved when it doesn’t seem to be doing the most appropriate thing.
For code pages this is a good reason to use Unicode. Windows is natively Unicode and most other systems understand it. It is also reasonably unambiguous, although it does have its own evolutionary quirks. By focusing on a single encoding (Unicode), we can reduce the complexities cause by natural variations introduced as encodings evolve.
Reminder: This is mostly my conjecture and seems reasonable to me, although it might be wrong or lack specifics.