GB18030 Character Set Support in SQL Server

Article
02/24/2009

Recently, I got several request related to GB18030 Character Set support in SQL Server. Here is one of the requests:

When I try to insert different combination of GB18030 characters in to a varchar column, it gives me unique constraint violation error.
insert myTable
values ('32066326-6E23-4E31-B226-9F8652CE11F3', N' 㐀㐁ᠠᠡᠢཌཌྷꀀꀁﺶﺷﺸ ', 'AC1', '')
insert myTable
values ('32066326-6e23-4e31-b226-9f8652ce11f7', N' ꀀꀁﺶﺷﺸ ', 'AC1', '')
does anyone know cause / solution / workaround?

Before going into the solution, I need to explain about what is GB18030 Character Set. Here, I copied the Unicode Faq on GB1830:

What is GB 18030?

GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980).

What is new in GB 18030?

It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980.
There are about 1.6 million valid byte sequences.
It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte - the second byte must be examined as well.

Why is GB 18030 important?

The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030.

How does GB 18030 relate to Unicode?

The specification refers directly to a mapping of GB 18030 codes to and from ISO 10646/Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.

In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.

So let me try to answer the customer's question. Remember varchar type always associates with a code page, and can only stored the characters defined in the code page. The code page which store Simplified Chinese is 936. Code page 936 corresponds to Chinese GBK standard, which is a subset of GB18030 Encoding. So it means that storing GB18030 characters in varchar type has potent data loss issue. On the other hand, since Unicode 3.0 and GB18030 can be roundtrip, so the best way to store GB18030 characters is use SQL Server's nvarchar type. Most of the new added characters in GB18030 map to the Unicode CJK Extension B block, while the block are all Supplementary characters which are represented with a surrogate pair. So supporting Gb18030 character becomes to the same as support Surrogate character, which already discussed in previous blog.

However, the nvarchar type use UCS-2 encoding, which is not the GB18030 encoding. My next point is related to Gb18030 encoding itself. What if user has GB18030 encoding data/files/scripts, the answer is that SQL Server can recognize this GB18030 encoded input, and convert them into Unicode, so you will no issue with these files/scripts/data.

References:

Ask Dr. International,Column #15, New Chinese Encoding GB-18030

GB 18030 WikiPage

GB18030 Characters in SQL Server

International Features in Microsoft SQL Server 2005

GB18030 Character Set Support in SQL Server

Additional resources