Why should we bother about Surrogate characer?

Today, someone ask following questions: 

1. We know that the Surrogate Characters have two bytes when counting, so when we design this character in DB table, we should set double size of your original type, shouldn't we?

2. When comparing string, if we can't use Binary collating sequence, then we will got different result from we think. So should we do some process to change collating sequence before comparing?

3. Since that Unicode will be supported limitedly, what should we pay our attention to?

 

To answer this question, I will briefly discuss the basic concept of Surrogate character and SQL Server's Unicode support. Unicode standard is a standard encoding to encode all character in the word.  In Unicode standard, a Unicode code point is represented by using a sequence of Hex value from U+0000 to U+10FFFF. These code points are categorized into 16 planes. A Plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000...U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF.  U+D800 to U+DFFF are reversed for encoding other planes in UTF-16 (discussed below).  We call planes 1 to 16 as Supplementary Planes, and characters defined in these plains as Supplementary Character.

The Unicode Organization defined following Unicode encoding forms: UTF-8, UTF-16, and UTF 32. Each encoding form maps the Unicode code points to unique code unit sequences. In UTF-32, each character is encoded with 4 bytes. In UTF-8, the character can be encoded by using one, two, three, or four bytes.  For UTF-16, we encode the plane 0 (U+0000 to U+FFFF) with 2 bytes, and 4 bytes for all other planes.   The 4 byte code point in UTF-16 should always follow rules: the first 2 bytes is in the range of 0xD800-0xDBFF) and the second Unicode character is in 0xDC00-0xDFFF.  We call such 4 bytes encode character as Surrogate pair. A surrogate pair represents a single Supplementary character; it should not be break into small piece.

What is defined in Supplementary planes?

Plane 1 currently has three allocation areas: a General Scripts Area at the beginning of theplane, containing various small historic scripts; a Large Historic Scripts Area, currently containing Sumero-Akkadian Cuneiform; and a Notational Systems Area, which currently contains sets of musical symbols, alphanumeric symbols for mathematics, and a system of divination symbols similar to those used for the Yijing .

Plane 2 consists primarily of one big area, starting from the first code point in the plane,
dedicated to more unified CJK character encoding. Then there is a much smaller area,
toward the end of the plane, dedicated to additional CJK compatibility ideographic characters—
which are basically just duplicated character encodings required for round-trip conversion
to various existing legacy East Asian character sets. The CJK compatibility
ideographic characters in Plane 2 are currently all dedicated to round-trip conversion for
the CNS standard and are intended to supplement the CJK compatibility ideographic characters
in the BMP, a smaller number of characters dedicated to round-trip conversion for
various Korean, Chinese, and Japanese standards.

SQL Server's Unicode support.

In SQL Server, we use the UCS-2 encoding for nvarchar/nchar types. The UCS-2 treats each 2 bytes as one Unicode character, and thus it treat a Surrogate pairs as two characters. As a result, all string functions which take position as input/output will return "wrong" result/break the surrogate pair when the input string contains surrogate pairs.   Here is the high level information about this topic:

  • Because surrogate pairs are considered to be two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (in other words, space for a surrogate pair).
  • Supplementary characters are not supported for use in metadata, such as in names of database objects. In general, text used in metadata must meet the rules for identifiers. For more information, see Identifiers in SQL Server 2005 Books Online.
  • Standard string operations are not aware of supplementary characters. Operations such as SUBSTRING(nvarchar(2),1,1) return only the high surrogate of the supplementary character's surrogate pair. The LEN function returns the count of two characters for every supplementary character encountered: one for the high surrogate and one for the low surrogate. However, you can create custom functions that are aware of supplementary characters. The StringManipulate sample in Supplementary-Aware String Manipulation, in SQL Server 2005 Books Online, demonstrates how to create such functions.
  • Sorting and searching behavior for supplementary characters may change depending on the collation. In the new 90_and BIN2 collations, supplementary characters are correctly compared, whereas, in older collations and standard Windows collations, all supplementary characters compare equal to all other supplementary characters. For example, the default Japanese and Korean collations do not handle supplementary characters, whereas Japanese_90 and Korean_90 do.

My suggestion to the question:

  • We know that the Surrogate Characters have two bytes when counting, we test that insert a Surrogate Character into nvarchar(1) type and got an error. So when we design this character in DB table, we should set double size of your original type, shouldn't we?

It really depends on the data distribution of your input data.  If nearly all your characters are Surrogate character, then it make sense to double it.   However, as far as I knew, the surrogate Characters defined in the Unicode CJK Extension B session are rarely used by Chinese people, which lead me to guess that surrogate characters will rarely used.  It is true that the CLR Supplement aware code can handle surrogate without breaking it, but keep in mind that  the CLR-version of string operation will be far slower that SQL server's string operation. For people who can read Chinese, This paper shows that without using CJK Extension B (which is defined in Unicode Surrogate Plane 1), the Chinese characters defined in Unicode can cover 99.99% of Modern Chinese characters.  

  • When comparing string, if we can't use Binary collating sequence, then we will got different result from we think. So should we do some process to change collating sequence before comparing?

You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result.  For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison.  I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.

  • Since that Unicode will be supported limitedly ,When we used the function below, what should we pay our attention to?

I think each string function has two properties: 1) whether it is sensitively to collation, i.e., whether it returns different result for different collation 2) whether it can handle surrogate characters without break it.  Suppose the input of these functions are Nvarchar/NChar type, here is the information about whether these functions are collation aware, and whether they are surrogate aware.

 

String Functions

Collation/Language Aware

Surrogate Aware

Comments

LEN()

No

No

The length of Surrogate pair is 2 instead of 1

LTRIM/RTRIM()

No

Yes

LTRIM/RTRIM will only remove the "Space" (Unicode U+0020).  Existing Surrogate characters are not affected.

Reverse()

No

No

Reverse will always change the order of a surrogate pair, cause unable to display/interpret the character

LEFT/RIGHT/ SUBSTRING

No

No

These functions might break surrogate pairs

PATINDEX, CHARINDEX

Yes

Depends on the usage

These functions wouldn't break the surrogate character, and they are correctly search any input characters, including surrogate chars.  However, the return result is not surrogate aware (i.e., it count 2 for surrogate character).

 

Replicate(),

No

Yes

If the input parameter is surrogate pairs, the return value does not break them.

STUFF()

No

No

This function might break surrogate pairs

Unicode()

No

No

If the input is a surrogate pair, it only return the code point for first character.

QUOTENAME()

No

Yes

 

LIKE

Yes

Yes

 

Order By, Group By, String Comparison

Yes

Yes

 

 

Note, since we treat each character as 16bit code-point, all functional which have position as input parameter will not be surrogate aware, such functions include: LEN, LEFT, RIGHT, SUBSTRING.  On the other hand, Like, Ordering, Grouping and String Comparison are collation aware operation, if user selects the appropriate collation, we have no problem with Surrogate character. Reverse always break the surrogate input. 

The while paper “International Features in Microsoft SQL Server 2005” at https://msdn.microsoft.com/en-us/library/bb330962.aspx is a good reference.