How do I get HKSCS 2004 characters from Big-5 in .Net?


Well, that’s pretty tricky.  We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code.

The fundemental problem is that these “HKSCS” characters were in use prior to the assigment of a code point for them in Unicode.  In order to support them, we mapped Big 5 / Code Page 950 HKSCS characters to the Unicode Private Use area.  So now there is data with these code points in the PUA and in Big 5, AND at the Unicode 5.0 code points.  The expectation is to use Unicode long term, so these functions were provided to help map old data to the new Unicode 5 code points.

Another way for a managed application to solve this problem would be to create your own Encoding and map the Big 5 code points to their new Unicode code points instead of the old code page 950 mappings.  It is nearly impossible for Microsoft to provide a patch to do this because some users have data in the old PUA code space and their applications would break if the data was suddenly migrated to the assigned HKSCS code points without them opting in.  Eventually “all” the interesting data should be migrated from the PUA code points to the Unicode HKSCS code points, but until then the problem remains.

The code samples and links from the “Microsoft Character Code Conversion Routines For HKSCS-2004” document would be a good starting spot to generate the necessary mappings to make an Encoding that moved code page 950 data to the new HKSCS code points.

Comments (3)

  1. Raymond Wong says:

    Seem like the HKSCS default got mapped to Unicode PUA area. After I have written my own Encoding class – mapping HKSCS to the new Unicode code point, is it possible to ask .NET to use my class in default?

  2. Raymond Wong says:

    Seem like .NET map HKSCS to Unicode PUA by default, if I have written my own Encoding Class, is it possible for .NET to use my Encoding Class by default? As .NET store string in Unicode/UTF8, while retriving data from Big5 encoded database, I suspect .NET carried out a conversion and map those HKSCS to Unicode PUA, which I want it map to the new Unicode Code Point. Please advise, Many Thanks.

  3. shawnste says:

    It is not possible to use a different Encoding by default, the application would have to explicitly use that encoding.

    FWIW: Encoding.Default isn’t recommended because it’ll be different on different systems.