How Come My "ț" (or Another Character) Doesn't Work in Code Page XXX?

Article
06/04/2008

First of all, as I always suggest, Use Unicode when practical :) Then you don't run into these kinds of problems.

The "thing" to remember about code pages in general is that they were an early way to get characters to display in a readable way on CRT MS-DOS displays, or, before that, for teletypes and such. ASCII is a common representation, but most software developers realized that one of the bits wasn't being used, and extended ASCII in several standard, and not so standard ways. Usually those extensions were for characters that someone thought were useful, but then other users discovered that some characters didn't "work" for their language and invented a variation of a code page, after all, all you had to do was change a bitmap font. Often times subtle distinctions between characters were lost, or users "made do" with the closest code page.

Sometimes the behaviors were pretty much a "hack". Some code pages represented right-to-left text, like Arabic, in a left-to-right fashion since their computer systems didn't really understand the concept of left-to-right text. On the CRT, in addition to using the 8th bit, MS-DOS reused the 1-31 code points for "symbols" since the ASCII values were invisible concepts like "bell" and "return". That hack allowed for card suits and console card games. Commodore did something similar with their PET fonts.

So what's this got to do with a Romanian ț (U+021B "Latin small letter t with comma below")? Well, code page 1250, "Eastern Europe" has a code point "0xfe" for ţ (U+0163 "Latin small letter t with cedilla"). I'm not sure of all of the history behind these characters, however they are different in appearance. I don't know if the 1250 ţ was originally intended for use with Romanian, however a cedilla isn't a comma below, and this distinction caused a seperate Unicode code point to be created. 1250 however still has its original meaning and U+021B isn't in it.

So, for Romanian, if you really want to use the "correct" U+021B character, then 1250 won't work (nor will any other non-Unicode code page). Those kind of subtle (to non-Romanian) issues are why its best to have Unicode applications and data stores. We can't really change the behavior of code page 1250, or else someone else's usage (even a Romanian application making do with the cedilla) will break.

How Come My "ț" (or Another Character) Doesn't Work in Code Page XXX?

Additional resources