Pitfalls of Chinese Conversion (Part 2)

We have talked about Kernel32.dll and its LCMapString API in the previous entry. In addition, I have shown you how to use the API to convert Simplified Chinese character to Traditional Chinese character or vice versa with sample codes provided.

If we perform a simple test on LCMapString API, we may find the limitation of the API. Let’s illustrate the limitation by running Chinese Converter application:


We type the Traditional Chinese phrase “頭髮 (means “hair of head” in English) in the left text field of the WinForms application, and convert it to its corresponding Simplified Chinese characters.


The characters头发 are displayed in the right text field. It does convert the characters correctly!


Let’s clear the text fields and make another test; We type the Simplified Chinese characters 头发 in the right text field and convert it to Traditional Chinese characters.


It shows “頭發 in the left text field now. It does not convert back to the original Traditional Chinese string “頭髮” as expected!

The conversion mistake is due to the mapping relationship between Traditional and Simplified Chinese is not exactly one to one. (although it is true for most of the cases!) Multiple Traditional Chinese characters may map to a single Simplified Chinese character!

In the 1950s, Mainland China began using Simplified Chinese characters to help increase literacy. Simplified character forms were created by decreasing the number of strokes of Traditional Chinese characters. Most of the simplifications are based on popular cursive forms embodying graphic or phonetic simplifications of the traditional forms. However, some of them were simplified irregularly. Of course, there are still a large portion of the characters were not simplified, and are thus identical between the Traditional and Simplified Chinese orthographies.

Japan also simplified a number of Kanji (Chinese characters) used in the Japanese language half century ago from Kyujitai Kanji (Traditional Chinese). The new forms are called Shinjitai Kanji. The Kanji simplification in Japan in general has a lesser extent comparing to the simplification of Chinese in Mainland China. As the simplification is taken separately in Mainland China and Japan, some of the Kanji used in Japan now are neither ‘traditional nor ‘simplified’.

All of those characters (Traditional Chinese / Kyujitai Kanji + Simplified Chinese / Shinjitai Kanji) code points are included in the Unicode standard during the Han Unification process! This was rendered necessary by the fact that the linkage between simplified characters and traditional characters is not exactly one-to-one.

This also means the existing method of machine conversion between Simplified Chinese and Traditional Chinese may have some mistakes. (Although it may incur less mistakes if we convert the characters from Traditional Chinese to Simplified Chinese) If the system were intelligent enough to translate sentences using the context, the number of mistakes would be reduced!

Comments (3)

  1. Ken says:

    Does LCMapString API do any conversion of encoding underlying?

  2. Daniel says:

    Interesting and good to know!

  3. Terry Sheng says:

    Thanks for your update. The API maps strings to and from Unicode based on the default Windows (ANSI) code page associated with the specified locale, i.e. it takes in an Unicode string and converts it to another set of Unicode string based on the dwMapFlags you have input to the API.

    In the first example I have illustrated in the blog, the input string is 頭(U+982D) 髮(U+9AEE). The dwMapFlags I have set in the API is LCMAP_SIMPLIFIED_CHINESE which implies to "map Traditional Chinese characters to Simplified Chinese characters". The output of the string is, thus, the simplified form of the input string, i.e. 头(U+5934)发(U+53D1). Please note the change of the output Unicodes!

    The API not only converts between Traditional Chinese and Simplified Chinese or vice versa, but also some other set of characters e.g. Japanese Hiragana and Japanese Katakana.