UTF-16 Encoding and SQL Server (2)

As you discuss in here,
in order to supporting UTF-16 encoding in SQL Server, we need to handle both "good"
data and "bad" data. Today, I will discuss different approaches of adding UTF-16
support in SQL Server.  For each approach
I described today, I will discuss the advantage and disadvantage, and leave
readers to judge what one you prefer.

The first approach is mainly focusing on handling good
data.  Since existing string functions are
not surrogate aware, and have possibility of breaking a surrogate pair (good data)
into single surrogate point (bad data), why want just introduce a set of new functions
which use the correct character length semantic?  For example, we can introduce char_len,
char_substring, char_left, char_right functions, which take a surrogate pair as
one character.  For existing string
functions, the semantic and result will not be changed.  What about "bad" data, I will say we possibility
have to treat as "good" data since we haven't prevent existing string functions
to generate bad data from good data.  The
advantage of this approach is that we are adding new functions, and leave old
function unchanged (in term of functionality and also performance). For people
who don't care about surrogate pair, they can continue to use old string
functions without seeing any difference. 
For people who care about surrogate pair, they can start to use new functions.  The disadvantage of this approach is that it
does not really solve the "bad" data issue. Also people have to choose between
the new string functions and the old string functions.

Let us talk about the second approach.  It is pretty simple: we just fix all existing
functions which take the correct character length semantic.  The good thing is that we will not break good
data and always return correct result.  
The cost is that we may pay certain performance penalty. For example, Before
LEN() function just can just need return number of WCHARs, which take O(1) time,
however, the new LEN() will have to traverse the whole string even this is no
surrogate character in the string, which will be quite slow.  Another issue is that changing the result of existing
functions is a breaking change. Len() function used to return 2 for a surrogate
pair, but it return 1 now.  Even the new result
is correct, some customer might still want keep the old behavior.   Also,
I haven't checked the return type of these string functions (I will start
another topic to cover this).  For example,
if substring(c1, 1,2) return nvarchar(1), then it need either return nvarchar(2)
 as data type.   This behavior
change is not trivial and it is definitely a breaking change.   One
possible solution is to use the compatibility level.  User can change the database's compatibility level
to be lower than the current one, and it will see old behavior and also enjoy the
better performance of old behavior.  Then,
what about the "bad" data?  I am favor of
fixing the bad data whenever possible.   What about database upgrade?  Do we want to scan the whole database to verify
no bad data or we just assume that there is no bad data, and only generate
error when we found bad data later?   I
like to raise another question here.  Can
you track whether there are surrogate characters in a column of one table?  If so, can we smartly use old algorithm for
such column?

I have the third solution in mind, which is introducing the
concept of Encoding in database level.  A
database can have an Encoding option which indicates the encoding of nvarchar
type. The default encoding is UCS-2 which is the current encoding we supported,
and all existing behavior will be the same.  A UTF-16 Encoding will be introduced and can
be set on database level.  Once the
database encoding is set to be UTF-16, all string functions will follow the
UTF-16 length semantic, and we will also do data validation as well.  The advantage of this approach is that user
will enjoy the better performance of old behavior by default, and no upgrade
issue at well.  New user can choose the
new behavior only when they intend to do.  What about changing encoding from UCS-2 to
UTF16, do we need to do data validation? 
I don't know the answer.  BTW, other
database vendors also use the same mechanism as well.

Finally, I like to briefly describe other approaches which I
thought of, but I don't like personally. How about introduce new collations,
such as latin1_general_UTF16_ci_as.  We
already have enough collations to confuse our customers, I am not favor of
adding more collations.  How about
introduce new column level encoding?   I
believe in most of case, people want the whole database have the same behavior,
having fine granular control over the encoding might not be interesting to our
customer.

In summary, performance and backwards compatibility will be
the two issues which need to be carefully considered when implementing UTF-16
in SQL Server.   My next blog will exam all
string functions to see what is the impact with UTF-16 Encoding.

BTW, Michael Kaplan posted a series of articles at here
to describe his idea of UCS-2 to UTF16 migration for SQL Server.  I have some difficulty to follow his
idea.  However, I can guess some basic
principle here. In some case, a sequence of Unicode points together as one
character from the end user's point view. 
Examples are: the two small ss in German language, Korean Jamo and Thai Language.  Breaking such character sequence is not
desired by end users.  Then, the question
is that should we also make the new function take "true" character into consideration.  Here, I reference Unicode FAQ "How should characters
(particularly composite characters) be counted, for the purposes of length,
substrings, positions in a string, etc. "
 
as the end of this topic

A: In
general, there are 3 different ways to count characters. Each is illustrated
with the following sample string.
"a" + umlaut + greek_alpha + \uE0000. (the latter is a private use
character)

1.
Code Units: e.g. how many bytes are in the physical representation of the
string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]

2.
Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided
by 4.

3.
Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Text Boundaries, as
well as in UTS #18 Regular
Expressions
.

The
choice of which one to use depends on the tradeoffs between efficiency and
comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all
low-level string operations, and then also supply layers above that provide for
#2 and #3 boundaries when circumstances require them. This approach allows for
efficient processing, with allowance for higher-level usage. However, for a
very high level application, such as word-processing macros, graphemes alone
will probably be sufficient.