About Unicode Enabling Applications and Code Pages

A few days ago, when I received an email from a friend, I could not recognize any of the characters in the email. It turned out that the web email has a feature – the user can set encoding for emails. The auto-select encoding did not choose the correct encoding for the email message. As a user, I needed to choose the right encoding from the “More Actions” menu in the web mail. It is a little surprising to see the new web email feature does not handle the message’s encoding correctly automatically.  This reminds how important it is to Unicode-enable applications. Today’s software can have many users who speak many different languages. Applications should be written with globalization in mind using Unicode if possible. This GlobalDev article explains what Unicode is, how to Unicode-enable a Win32 Application, and what are the best practices when writing Unicode applications. For .Net applications, MSDN has detailed documentation to help developers to write world ready .Net applications.  Another article gives some nice code examples for Win32, .Net framework and web pages.

Speaking of Unicode, one might wonder where are the UTF-8 code pages. As we know, code pages are installed under %windir%\system32 folder and registered at HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage. Most of the code pages are table based and use .NLS file extension. These code pages’ identifier are typically less than 50000. Code pages with identifier number greater than 50000 are usually DLL based. For example, C_G18030.DLL for code page 54936 provides Chinese character support. C_ISCII.DLL code pages provide support for Indian scripts based on Indian Standard Code for Information Interchange. You can find code page identifiers used in Windows on MSDN. An interesting question is that why we cannot find an entry for c_65000.nls or c_65001.nls in the registry. It turned out code page 65001 and 65000 are in kernel32.dll.  


– Harriet

Technorati Tags: ,

Comments (0)