What is this magic setting that synthesizes Unicode from non-Unicode?


Commenter dan g. wonders how Windows can treat non-Unicode applications as Unicode via the Regional and Language Options control panel, specifically the part that lets you choose the Language for non-Unicode programs. "Having always believed that the only way to display, say, Chinese characters correctly was to compile with _UNICODE, this facility seems all the more remarkable."

This setting is really not as magical as it appears. (After all, we had Chinese versions of 16-bit Windows that displayed Chinese characters just fine, and they certainly didn't use Unicode since Unicode hadn't been invented yet.) Michael Kaplan went through this and many other settings in the Regional and Language Options control panel, and from the chart at the top of the page, you see what Windows XP calls the Language for Non-Unicode Programs used to go by the name Default System Locale. The old name does a better job of describing what it actually does but does a worse job of describing what it's used for.

In Win32, three character encodings have special status. Unicode (more precisely, UTF-16) of course is what Windows uses internally. There are also two 8-bit code pages: CP_ACP, the so-called ANSI code page (even though it isn't actually ANSI), and the CP_OEM code page, the so-called OEM code page (even though it isn't provided by the OEM).

When a non-Unicode program calls a function like TextOutA to display a string represented in the ANSI code page, the string is converted to Unicode via the CP_ACP code page. The Language for non-Unicode programs setting controls what code page CP_ACP corresponds to. On U.S. systems, it's typically code page 1252, but you can change it via that control panel. And that's where it becomes possible to display Chinese characters without using Unicode.

For example, code page 950 is a double-byte code page commonly seen in countries that use traditional Chinese characters. It can represent the English alphabet of A-Z, and through the use of double-byte characters can also represent a wide array of traditional Chinese characters, such as this block of characters which are represented by byte sequences of the form B3 40 through B3 FE. If the ANSI code page is code page 950 and you pass data formatted for that code page to, say, the TextOutA function, the corresponding Chinese characters will display, even though the program itself doesn't use Unicode explicitly.

That's why it's called the Language for non-Unicode programs. It specifies which character set non-Unicode data should be interpreted as.

Comments (15)
  1. Dave says:

    When I’ve dealt with this in the past, I couldn’t understand how the apps that used CP_ACP could ensure that the character stream they were interpreting was in the code page that they expected. If I send a .txt file to a person in China, do they just go through code pages until it seems to display correctly?

    At least for Unicode and UTF-8 there are marks at the front of the file that disambiguate. Assuming the app put them there, and the app reading them knows to interpret rather than display them. Aw crap, now you’ve brought back the voices in my head.

  2. laonianren says:

    @Dave: Assuming you’re writing in English, if you send a text file to China it will likely be readable.  As far as I know, every Windows code page that you can select as the default code page contains ASCII as a subset.

    But if you receive a text file containing Chinese text you’ll have to guess what code page they used.

  3. Anonymous says:

    I’m not sure why UTF8 isn’t as popular for Windows apps as it is on Unix type OSes and many internet protocols.  Space efficiency for some languages is worse, but at least you can have an 8-bit encoding that can do the full range of unicode characters.  And you don’t have to recompile old programs (in most cases), keep track of if you want sizeof(foo) or sizeof(foo)/sizeof(*foo), make Foo a macro for FooA or FooW, etc.

  4. eff Five says:

    Dave,

    If this post brings “back the voices in my head” you really shouldn’t read Michael Kaplan’s aforementioned blog. Especially the 1300+ posts he tagged as international programming all of which may likely do this for those with your affliction.

    Additionally you may want to view his warning (http://blogs.msdn.com/michkap/pages/7934999.aspx ) which regrettably doesn’t include "may cause voices in your head."

  5. Jonathan says:

    Anonymous:

    Windows NT uses UTF-16 almost everywhere – kernel, file system, Win32 API, etc. Presumably, that’s because it was developed before UTF-8 was created. As a result, it is quite natural to keep your own strings in UTF-16 as well – no need to convert when calling into Win32 (or higher-level APIs like COM or .NET). That is, unless you aim your code to be portable to non-Windows systems.

  6. Kaenneth says:

    "and the app reading them knows to interpret rather than display them"

    That’s why FEFF is used as the Unicode byte-order mark, it’s a Zero-width, Non-breaking space, meaning it occupes no space, does not affect formatting, and is invisible. So it dosn’t matter if the application ‘shows’ it, as there is nothing to show.

    Screws up older text-file processors though, if they depend on the first character in the file being something specific (like *NIX shell scripts)

  7. Cheong says:

    Miral: Agreed. At least I use AppLocale more often then the built-in Application Compatability tab. (They do compatability job so nice that nearly no need to tweak these settings to run old program)

    I need to use old applications developed in old version of Delphi which is native in Simplified Chinese (GB2312), plus a few Japanese (Shift JIS) games that runs certain script engine, but as Traditional Chinese user, I need to set the default locale to Big5 in order to read / run my old files (like old homework project, fictions or so)

    Having set Applocale option for these 3 languages in context menu is handy for me. And I also wish a switch to disable that dialog box too (Or they could add the checkbox for "Not showing this in the future").

  8. Miral says:

    This reminds me: I wish AppLocale were more integrated into the system (ie. the locale setting were just a standard property of all shortcuts).  And that you could disable that annoying dialog box.  (And yes, I know about the hacked version that doesn’t display the dialog box.)

    It’s admittedly not something that I need to do *often*, but every once in a while I get documents saved in ANSI format using a different codepage, or I need to run a non-Unicode program that’s expecting a different codepage (sometimes frequently).  Neither the global you-must-reboot setting nor the AppLocale danger-danger-this-is-just-temporary shortcuts quite cut it.

  9. Karellen says:

    Anonymous:

    UTF-8 in Windows is not widely used because it’s not available as a code page.

    Unfortunately, it can’t be made available as a code page as under Windows MB_LEN_MAX is 2. (Under most Unices, I think it’s around 6).

    Further, you can’t change MB_LEN_MAX as that’s a breaking ABI change. Existing apps compiled with char arrays of size (MB_LEN_MAX + 1) to encode a single multibyte character into will end up scribbling on other parts of the stack, with disasterous consequences.

  10. DoesNotMatter says:

    Mandatory dumb question: Why can’t windows handle the language display on a case by case basis without reboot ?

    Ie. like the compability mode. Dropdownlist, choose language, done.

  11. Anonymous says:

    In Win32, three character encodings have special status. […] There are also two 8-bit code pages: CP_ACP, [..,], and the CP_OEM code page […]

    Wine adds another one: CP_UNIXCP, which corresponds to the native code page (usually UTF-8).

  12. Anonymous Coward says:

    @Miral: agreed. I think it should be a standard part of Windows, as a dropdown list in the shortcut properties tab.

    It can’t be that hard to do. Does anyone here know how AppLocale works, internally? Then we could perhaps add a property page to shortcuts, is that possible? Or would a new shortcut format be a necessity?

  13. Alexandre Grigoriev says:

    @Karellen:

    UTF-8 is available as code page CP_UTF8 (starting from Win98 with MSLU), but it’s not associated with a locale.

  14. giant puppy says:

    "When a non-Unicode program calls a function like TextOutA to display a string represented in the ANSI code page, the string is converted to Unicode via the CP_ACP code page."

    Unless you specify something different from DEFAULT_CHARSET in CreateFont. See also http://blogs.msdn.com/michkap/archive/2005/05/13/417060.aspx

  15. Yuhong Bao says:

    "UTF-8 is available as code page CP_UTF8 (starting from Win98 with MSLU)"

    Actually starting with NT 4.

Comments are closed.

Skip to main content