UTF-16 Encoding and SQL Server (1)

I will start several articles describe my thought on UTF-16
support in SQL Server.  Before I start to
discuss this topic, I use the content "Supplementary
Characters, Surrogate Pairs and SQL"
  to
describe the difference between UTF-16 and UCS-2:

UTF-16: This encoding stores the basic Unicode
characters using single 16 bit units and others characters using two 16 bit
units. UTF-16 is the primary encoding mechanism used by Microsoft Windows 2000,
Windows 2000 Server, Windows XP and Windows 2003 Server. On the other hand SQL
server 2000 uses the UCS-2 encoding schema to store Unicode data.

Note: The UCS-2 encoding scheme is
actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is
identical to the encoding of the same code point in UTF-16. Also, most new
implementations using the Unicode standard now employ UTF-16, UTF-8 or UTF-32
instead of UCS-2.

I recommend your guys read the above article, Wiki page UTF-16/USC-2 and also Supplementary
Characters in SQL Server 2005
. Also, this
article
explains why we call Windows support UTF-16 encoding even the core
Unicode data type WCHAR is not surrogate aware.   

To answer the question of how SQL Server can support UTF-16
Encoding for nvarchar data type, we need to solve following two issues:

How to deal with good data? In here, the good
data is the data conforming the UTF-16 encoding, i.e., no invalid characters in
the data (in particular, no single surrogate code point). The answer is pretty simple: following the
correct length semantic and don't make the data invalid. The nvarchar type internally use WCHAR* to
store unicode value, while each WCHAR is a short type (2 bytes). It is necessary
to distinct the number of characters with number of WCHARs. For
any character based operation we provide to end user, the system should treat a
surrogate pair as a single character.
Examples are len function
return number of characters, and for a surrogate pair, count as one
characters. All string functions which
take an integer index as input parameter, the index should mean the character
index. However, internally, we might not
necessary to use length semantic.

There are many cases that a good
UTF-16 data might become invalid, so we need to avoid such case.  Using correct character semantic is one case
we certainly need to do.  There are other
cases that we might generate invalid data from good data. Examples are:  when sending/receiving data to/from the client,
we might send/receive data in chunk, then we need to be careful of not break
the surrogate data. Another case is for nvarchar(max) and ntext type which
store data in multiple physical pages, then store a surrogate pair in different
page might have potential issue.

Now, let us consider the "bad" data. According the UTF-16 standard, the bad data
here is the data which has only one surrogate point (i.e., not a pair). The first question is that how we prevent bad
data be persisted into database. There
are many ways to insert data into a database, such as converting from binary to
nvarchar type, bcp/bulk insert, and calling RPC etc. Note, C# ‘s String type and C++ WCHAR type has no
UTF-16 validation checking, which means
that SQL Server must rely on itself to do input validation. Actually, we already have certain UTF-16 validation
inside SQL Server, i.e., Surrogate data, (either paired or single) are not
supported for use in metadata, such as in names of database objects, SQL Server
do have such check for disallowing Surrogate Data.

Another question I like to raise is that what
if we failed to do UTF-16 data validation. 
For example, suppose the data stored on the disk already invalid (this
can happen if we up-grade from old version of SQL Server, and there are some
invalid UTF-16 data stored.  Of course,
we can do data validation during up-grade, but it will be a size of data operation
and we usually want to avoid it), what is the result of calling a string
function with such invalid data as input. 
We can 1) throw an error and terminate the statement 2) treat a single
surrogate point as other normal character 3) or change the invalid data to a
valid data (such as a question mark ? which we did when converting a nvarchar
data into varchar data, but the data is not in the code page associated with
the varchar type).   Case 2 raise another
interesting question which is that can we just treat a single surrogate point
is still a valid character, and don't do any data validation? 

Let us end today's talk with an example of UTF-16 Support in
SQL Server today.  In SQL Server 2005, we
have XML data type support.  The XML data
type is internally stored as UTF-16 encoding (we use varbinary(max) data type
internally, but the content itself is UTF-16 encoded text).  In particular, single surrogate point is not
allowed, and SQL Server will automatically detect such invalid data when doing
XML parsing.  Some of the XQuery, such as
substring function internally call SQL Server's substring function, which made
these function surrogate unsafe.