UTF-16 Encoding and SQL Server (3) - Length Semantic

 As you discuss in here and here before, in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good" data and "bad" data, there are different approaches to achieve this.  Today, I will talk about the length semantic again.

In SQL Server, we have two categories of string type: char/varchar/varchar(max)/text types always associate with a code page, and can only store the characters in the range of that code page;  nchar/nvarchar/nvarchar(max)/ntext types always use USC-2 encoding, and can store any Unicode characters.   When I declare a table has two columns, one for varchar, another for nvarchar:

                Create table t1(c1 varchar(30), c2 nvarchar(30))

For column c1, the 30 means that the system will allocate 30 bytes for that variable, and it can only hold as many characters as the 30 byte can hold.  The above table has no collate clause associate with the column definition, thus they will use the current database's collation as their collation.  If the codepage of column c1 (which is controlled by the collation of that column) is one of the single-byte character-set (SBCS, is used to refer to character sets which use exactly one byte for each graphic character), the column can ALWAYS store up to 30 characters. However, if the code page is one of the MultiByte Character Sets (MBCS, is used to refer to character set which use variable length of bytes for each graphic character), then it can still store up to 30 characters, but it is not always true.  Suppose the collation of column c1 is Chinese_prc_ci_as, which has code page 936 (Simplified Chinese).  The codepage encodes ASCII characters (range from 0 to 127) with one byte, and all other Chinese characters in two bytes.  In this case, column c1 can always hold at least 15 characters (when all these charactes are Chinese), and at most 30 characters (if all of them are ASCII).  In summary, for varchar type, the length semantic is the number of bytes.

Let us consider the case for nvarchar type.  Here is the definition of this type I copied from BOL:

Unicode data and use the UNICODE UCS-2 character set.

nchar [ ( n ) ]

Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes. The ISO synonyms for nchar are national char and national character.

nvarchar [ ( n | max ) ]

Variable-length Unicode character data. n can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.

so column c2 in above table can ALWAYS hold up to 30 Unicode UCS-2 characters.  So we can say that, for nvarchar type, the length semantic is number of characters.

What if column c2, the nvarchar type, is encoded in UTF-16?  We have two options: 1)   we say c2 can hold up to 30 Unicode UTF-16 characters, but NOT ALWAYS.  It is because of a surrogate pair takes 4 bytes, if we still allocate 62 bytes for values stored in c2 which cannot hold 30 surrogate pair.  2) We can say c2 can ALWAYS hold up to 30 Unicode UTF-16 characters by internally allocate 122 bytes for values stored in c2.

Let us compare about the two approaches.  Case one has less code change and no breaking change in existing functionality, but it will more like the varchar's length semantic, and customer need aware of this.  Case two has better semantic meaning of character length, and customer might desire for this.  Drawback is that user can only declare length up to 2000. For example, c2 nvarchar(2001) will be invalid since in theory, we might end-up with 8006 bytes which cannot hold in one physical page.  It is a breaking change for customer, and it also has certain benefit (I will discuss later). Oracle and MySQL both use the second approach.

Now, let us examine one of the string functions: substring which is declared at

SUBSTRING(value_expression ,start_expression , length_expression )

In UTF-16, the start_expression and length_expression argument should both use the correct UTF-16 character length semantic; such that a surrogate pair is always treat as one character. If I run query

                Select substring(c2, 10,2) from t1

What is the return type for the above query.  Today, since we know that the query return at most two character per value, so the return type will be nvarchar(2).  It will be problematic when c2 is UTF-16 Encoding.  The reason is that if the returned two characters are both surrogate-pairs, what is the return type then?    The answer really depends on the above two approach of UTF-16 length semantic.  For approach one, we need either return nvarchar(4), or return nvarchar(2) but only hold the first surrogate pair.   For approach two, it is easy, the return type is nvarchar(2) although this nvarchar(2) has different meaning of our old nvarchar(2) in UCS-2.

What about this query:

               Select substring(c2, 10,@len) from t1,

 since the system don't know the value of variable @len, the return type will be nvarchar(30).

What about substring with varchar type since some characters are also double bytes.  I will check tomorrow, and update the result here.