UTF-16 Encoding and SQL Server (4) - String Function Case Study

Today,   I will give
an example of different implementation of a string function.  Let us look at function:

SUBSTRING(value_expression
,start_expression , length_expression )

In here, I proposed
four kinds of alternative functions by giving them different name to distinct
the functionality of them.

1.      
USC2-SUBSTRING.   This function has the same functionality as the SUBSTRING
function.

2.      
UTF16-SUBSTRING: This function take UTF-16 string as
input, return type is also UTF-16 string

3.      
UTF8-SUBSTRING: This function take UTF-8 string as input,
return type is also UTF-8 string

4.      
CHAR_SUBSTRING: This function is a linguistic
version of substring, it wouldn't break the character in the given language,
and the encoding can be either UCS-2/UTF-8 or UTF16.

For the length semantic, I will assume that it is character
length with respected to the corresponding Encoding. I will use following table
to show the differences between these functions when calling function:

                Declare
@a nvarchar(10)

               select substring(@a, 5, 2)

, suppose the 10 here represents 10
characters
  

Function Name

Input Encoding

Output Encoding

Output Type

Output  Buffer Bytes

Surrogate Aware

Collation

Sensitive

Algorithm

Complexity

USC2-SUBSTRING

UCS-2

UCS-2

Nvarchar(2)

4

No

No

Simple, and Fast

UTF16-SUBSTRING

UCS-16

UCS-16

Nvarchar(2)

8

Yes

No

Always scan from the beginning, need to handle surrogate pair.

UTF8-SUBSTRING

UTF-8

UTF-8

Nvarchar(2)

8

Yes

No

Always scan from the beginning. See (1)

CHAR_SUBSTRING

Variance

=input

5, see (2)

Depends on encoding

Yes, see (3)

Yes

More complex. See (4)

 

(1)   
 For
UTF8-SUBSTRING. In theory, we can do a conversion from UTF-8 to UTF-16, and
call UTF-16 version of the algorithm. 
But it will have double conversion since the result need to convert back
to UTF-8.

(2)   
It is unknown for how many bytes one linguistic
character will take until the run-time. 
So we have to make the result type = input_length- input_index

(3)   
A surrogate pair is always treated as one
linguistic character, just it is surrogate aware

(4)   
The function itself will be pretty complex. For
looking for one character, it needs search deeper and the move the pointer back.
 

In addition, I like to give some
alternative solution for UTF16_SUBSTRING:

1)     
We call UCS2-SUBSTRING internally, and the last
character it return a leading surrogate character, just removing such character
from the result. The advantage of this approach is that it is fast, and it
wouldn't break surrogate pairs.  The drawback
is that it doesn't respect UTF-16 length semantic, which will lead programming
issues.

2)     
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(2), and here the 2 means that four UCS-2 characters.  I.e., we keep the existing return type, and
semantic as the same.  Again, the result
can only hold one UTF-16 character instead of 2.

3)     
We call the new UTF16-SUBSTRING function above, but
return type is nvarchar(4), and here the 4 means that four UCS-2 characters.  I.e., we obey the UTF-16 length semantic for
input parameter, but not for output.   In
these ways, we have the same meaning of nvarchar(4) in UCS-2 encoding and
UCS-16 encoding.  But we still break the
return type from nvarchar(2) to nvarchar(4).

Finally, I like to point out that we have CLR examples of Supplementary-Aware
(Supplementary character refer to a Surrogate Pair) String Manipulation at here.  It internally uses StringInfo.ParseCombiningCharacters
Method.  It not only handles Supplementary
characters, but according to the API, it handles other combining character
as well.  Note, the different between this method and the CHAR_SUBSTRING I
presented here is that there are more language specific rules or character sequences
which cannot be handled by the C# method.