SQL Server and UTF-8 Encoding (1) -True or False

Article
04/10/2009

Today, I will start my series of articles about SQL Server and Unicode UTF-8 Encoding. In many times, I found when people ask me about UTF-8, they actually don't understand UTF-8. So today's talk will be quite short. I just clarify some misunderstand.

1. SQL Server doesn't support Unicode, do you mean UTF-8?

Sometime, people just say "SQL Server doesn't support Unicode". Actually, it is wrong, SQL Server support Unicode since SQL Server 7.0 by providing nchar/nvarchar/ntext data type. But SQL Server doesn't support UTF-8 encoding for Unicode data, it supports the UTF-16 encoding. I copy several definitions from Internet for these concepts:

    "Unicode is the universal character encoding, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.  The unique number, which is called the code point of a character, is in the range 0 (hex) to 10FFFF(hex). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For example, Character A is defined as  "Latin Uppercase Alphabet",  and assigned a code point U+0041.

       In additional to assign each character to a unique code point, Unicode Consortium also defined several Unicode transformation formats (UTFs) which are the algorithmic mapping from every Unicode code point to a unique byte sequence. Note, the Unicode code point itself has nothing related to how to store/transform in a Computer, but a UTF does.

       The commonly used UTFs are UTF-16 and UTF-8. UTF-8 is the byte-oriented encoding form of Unicode which commonly used on Unix Operating System, Web Html and Xml File.  UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. UTF-16 is commonly used on Windows, .Net and Java. The transform between different UTFs are loseless and round-tripping. "

In summary, don't confuse with Unicode and UTF-8 Encoding. They are totally different concepts.

2. UTF-8 Encoding is much better than UTF-16 Encoding

There are tons of articles comparing with UTF-8 encoding with UTF-16 encoding. I will compare these two encoding side by side in my next article. I can have 100 reasons to say UTF-8 Encoding is better than UTF-16 Encoding, and verse vice. The correct answer is that no encoding is absolute better than the others. User should choose the suitable encoding according to your application software requirement. The operation system, programming language, database platform do matter when choosing the encoding. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

So please don't jeopardize SQL Server's Unicode support because of it only support one of the UTFs.

3. SQL Server cannot store all Unicode Characters

You may notice that I say SQL Server support UTF-16 Encoding in previous paragraph, but I also said SQL Server' nvarchar type encoding is UCS-2. I intend to do this in this article because I want to discuss the different between these two Encodings in here. Let us describe in details in what area SQL Server support UTF-16 encoding:

SQL Server can store any Unicode characters in UTF-16 encoding. The reason is that the storage format for UCS-2 and UTF-16 are the same.
SQL Server can display any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows functions to display characters, the Windows functions and fonts can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
SQL Server can input any Unicode characters in UTF-16 encoding. The reason is that we internally call Windows IMEs (Input Method Editors) to input, the Windows IMEs can take care of the supplementary character (a character take 4 bytes in UTF-16) correctly.
SQL Server can sort/compare any defined Unicode characters in UTF-16 encoding. Note, not all code points are map to valid Unicode character. For example, The Unicode Standard, Version 5.1 defines code points for around 10,000 characters. All these characters can be compared/sorted in SQL Server latest version: SQL Server 2008.

In contrast, I also list the UTF-16 thing SQL Server doesn't support:

SQL Server cannot detect Invalid UTF-16 sequence. Unpaired surrogate character is not valid in UTF-16 encoding, but SQL Server accept it as valid. Note, in reality, it is unlikely end-user will input invalid UTF-16 sequence since they are not support in any language or by any IMEs.
SQL Server treats a UTF-16 supplementary character as two characters. The Len function return 2 instead of 1 for such input.
SQL Server has potential risk of breaking a UTF-16 supplementary character into un-pair surrogate character, such as calling substring function. Note, in the real scenario, the chance of this can happen is much lower, because 1) supplementary character is rare 2) string function will only break this when it happens be the boundary. For example, when calling substring(s,5,1) will break if and only if the character at index 5 is a supplementary character.

In summary, SQL Server DOES support storing all Unicode characters; although it has its own limitation. Please refer to my previous blogs to details.

SQL Server and UTF-8 Encoding (1) -True or False

Additional resources