Hacking Code Pages, or "How to Totally Hose Your Machine AND Your Data".

WARNING: If you do any of the things in this blog you could end up with majorly messed up data, an unusable system, or both. Nothing mentioned here is recommended or supported by Microsoft or myself. At the very least if you encode data with a non-standard code page you’ll end up with data no one else can read correctly.

Several people have wondered how to support a unique code page that they "need" (they "need" to use Unicode :-)).  I’ve blogged about the complexity of supporting code pages, alluding to why we don't support more, or variations of existing ones.

That doesn’t mean our users haven’t tried to figure out how to support their own code page.  Several people have figured out how our code pages work, and blogged about it.  We've documented some of it on MSDN.  This doesn’t mean that hacking your own code page is supported, and these techniques are likely to change in future versions or service packs of windows.  I thought I’d blog about it anyway since many people seem to find the topic interesting.

There are sort of 3 ways that data can get converted to/from a code page/encoding on a Windows machine:  The .Net Framework has an Encoding class, Windows uses MultiByteToWideChar()/WideCharToMultiByte(), and some applications use MLang (deprecated but still needed in some cases).  Custom .Net Encodings are somewhat trivial since you can just make your own Encoding class.  .Net also ships with its own code pages, so the rest of this stuff pertains only to the Windows MultiByteToWideChar()/WideCharToMultiByte() and MLang behavior.  For the most part, MLang tries to call MultiByteToWideChar()/WideCharToMultiByte() (MB2WC/WC2MB).

When MB2WC gets a code page to convert, it looks in the registry at HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlNlsCodePage to see how to convert the data. https://support.microsoft.com/kb/102987 and https://msdn2.microsoft.com/en-us/library/aa462210.aspx (CE page, but it’s similar) discuss this somewhat.

For a given code page number, there’s a value like “c_1252.nls” for 1252 or “c_iscii.dll” for 57002.  If a value exists in the registry its considered “installed” and is usable.  Typically the dependent code page file (like c_1252.nls or c_iscii.dll) needs to be present in the %windir%system32 folder as well.  MB2WC assumes that code pages < 50000 are table based and those > 50000 are dll based.  Some, such as UTF-8, are algorithmic and aren’t looked up in this manner.  Note that some values could map to the same .nls file as other values.  In this case the code pages are identical.

For table based code pages, MB2WC opens the appropriate table and does the appropriate mapping, so to add a file we just have to add the appropriate value to the registry and file to %windir%system32.   If we renamed c_1252.nls to c_1250.nls then we’d get mixed up behavior for those code pages.  You could see this by making an entry for 123 and mapping it to an existing file, or copying another file to c_123.nls.  MB2WC would then treat 123 just like any other code page.

Note that code page names code from resources built in to the system, so any self-made code page would have a missing resource name, which would likely cause problems in some applications.  (Which is part of the reason not to muck with code pages).

Konstantin Kazarnovsky posted about the file structure at https://shlimazl.nm.ru/eng/nls.htm .  Frankly I haven’t really dug into it, but the file structure is fairly simple and should be easy to reverse engineer if you really wanted to.  In particular single byte code pages have simple flat files and you could probably use a binary editor to do whatever you wanted.  https://me.abelcheung.org/2006/09/12/what-is-cp951/ states that there are some bugs in the other pages description for double byte code pages.

One interesting thing about the table based code pages is that the tables include “best fit” mappings.  If you were particularly cautious about security you could edit these files to remove the best fit behavior and provide only unique mappings.

It is also worth noting that the tables only support UTF-16 code points, so its impossible for a table based code page to map to Supplementary Unicode Code Points.

The ACP and OEMCP registry entries define the “system locale” and are kind of scary to change since they tell us which code page the system is using.  (Go into intl.cpl and change the values for Language for non-Unicode programs to see these change).  You obviously could set these to non-standard values, but that impacts kernel mappings as well and could have undesirable behavior, which is why we don’t allow custom mappings of the system locale code pages using Custom Cultures.  You could probably hack a custom culture .nlp file, but that’d be scary.  Note that the ACP/OEMCP have to map to table based code pages.

In addition to the table based mappings MSDN https://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_NlsDllCodePageTranslation.asp  points us to the NlsDllCodePageTranslation function used by the dll-based encodings.  We ship 3 of these: c_is2022.dll for the ISO-2022-xx encodings, c_g18030.dll for the gb18030 encoding, and c_iscii.dll for the ISCII encodings.  It is probably more interesting to make a dll based encoding since then the mystery of the code page tables doesn’t have to be solved and you can expand it to include supplementary characters.

For the dll based encodings you just have to implement and export the single NlsDllCodePageTranslation function, which, as MSDN says, needs to provide NLS_CP_CPINFO, NLS_CP_MBTOWC, and NLS_CP_WCTOMB behavior.  Those either provide code page information, or return multi byte to wide char or wide char to multi byte conversions.  Remember again that we might change this in future windows versions.  We added for example the ability to pass MB_ERR_INVALID_CHARS and WC_ERR_INVALID_CHARS in Vista.

One of the most interesting dll based uses would be to convert non-standard data to Unicode.  In that case you’d only need to implement the NLS_CP_MBTOWC flag.

Michael’s also briefly mentioned the “custom code page” idea at https://blogs.msdn.com/michkap/archive/2006/07/05/656283.aspx 

If you do add a code page, then most applications probably won’t be able to use it.  If you work around that by spoofing an existing code page you’re entering dangerous territory with odd potential for incompatibilities.  In some places users have had difficulty merely because their community changed the idea of what the “best” system code page to use was, so using completely hacked code pages would be understandably more confusing.

Please remember that mucking with code pages can mangle your data and kill your system, so if you really feel inclined to play with these things please use care and remember that it is completely unsupported and whatever you do will probably break in the next windows version. If you must do this, I’d strongly recommend using your custom code page only long enough to convert your data to Unicode.