SQL Server 2008 R2 UNICODE Compression – what happens in the background?

SQL Server 2008 R2 added a much requested feature: Unicode compression. It addresses the need to compress Unicode strings. It is implemented as part of ROW compression, which was added in SQL 2008. That is; if ROW compression (on SQL 2008 R2) is enabled on a table that contains NCHAR / NVARCHAR datatypes, then the Unicode compression algorithm will kick in for each individual Unicode column.

Note: since PAGE compression is a ‘superset’, which also includes ROW compression, it will also enable Unicode compression.

There is another Blog that talks about the basics of Unicode Compression including specific compression numbers on various Locales. It can be found here: https://blogs.msdn.com/sqlserverstorageengine/archive/2009/08/17/unicode-compression-in-sql-server-2008r2.aspx

However, I recently ran an ISV test on Unicode compression and it didn’t produce the expected results, so I decided to look at the actual compression implementation in more detail.

The current implementation of ROW compression is very simple to understand; all columns are implicitly converted to variable length columns under the covers. I assumed that Unicode compression would be equally intuitive. Today all SQL Server Unicode data is stored in UCS-2 format. This means that a single byte ASCII character such as ‘a’ would be stored as two bytes (0x0061) in NCHAR vs. a single byte in CHAR (0x61). With Unicode compression I assumed that these single byte ASCII characters would be compressed as single byte, and ‘traditional’ double-byte characters (such as Russian; where the leading byte is not 00) would continue to be stored as two bytes (as they are today).

Well, this assumption is not really true… to understand why, one needs to understand that the actual algorithm (SCSU) employed by SQL Server 2008 R2.

The algorithm used to implement Unicode compression (SCSU) is described in the following spec: https://unicode.org/reports/tr6/.

Looking at the spec, one can see some things that completely make sense, but are not entirely intuitive. For example (from the spec):

9.1 German

German can be written using only Basic Latin and the Latin-1 supplement, so all characters above 0x0080 use the default position of dynamically positioned window 0.

Sample text (9 characters)

Öl fließt

Unicode code points (9 code points):

00D6 006C 0020 0066 006C 0069 0065 00DF 0074

NOTE: In SQL Server 2008 and prior versions this would translate into 18 Bytes in a NCHAR column (using UCS-2 encoding).

Unicode Compressed (9 bytes):

D6 6C 20 66 6C 69 65 DF 74

9.2 Russian

Sample text (6 characters)

Москва

Unicode code points (6 code points):

041C 043E 0441 043A 0432 0430

NOTE: In SQL Server 2008 and prior versions this would translate into 12 Bytes.

Unicode Compressed (7 bytes):

12 9C BE C1 BA B2 B0

9.3 Japanese

Japanese text almost always profits from the multiple predefined windows in SCSU. For more details on this sample see below.

Sample text (116 characters)

 ♪リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リンゴの歌」がぴったりするかもしれない。米アップルコンピュータ社のパソコン「マック(マッキントッシュ)」を、こよなく愛する人たちのことだ。「アップル信者」なんて言い方まである。

Unicode code points (116 code points)

3000 266A 30EA 30F3 30B4 53EF 611B
3044 3084 53EF 611B 3044 3084 30EA 30F3
30B4 3002 534A 4E16 7D00 3082 524D 306B
6D41 884C 3057 305F 300C 30EA 30F3 30B4
306E 6B4C 300D 304C 3074 3063 305F 308A
3059 308B 304B 3082 3057 308C 306A 3044
3002 7C73 30A2 30C3 30D7 30EB 30B3 30F3
30D4 30E5 30FC 30BF 793E 306E 30D1 30BD
30B3 30F3 300C 30DE 30C3 30AF FF08 30DE
30C3 30AD 30F3 30C8 30C3 30B7 30E5 FF09
300D 3092 3001 3053 3088 306A 304F 611B
3059 308B 4EBA 305F 3061 306E 3053 3068
3060 3002 300C 30A2 30C3 30D7 30EB 4FE1
8005 300D 306A 3093 3066 8A00 3044 65B9
307E 3067 3042 308B 3002

Unicode Compressed (178 bytes)

08 00 1B 4C EA 16 CA D3 94 0F 53 EF 61 1B E5 84
C4 0F 53 EF 61 1B E5 84 C4 16 CA D3 94 08 02 0F
53 4A 4E 16 7D 00 30 82 52 4D 30 6B 6D 41 88 4C
E5 97 9F 08 0C 16 CA D3 94 15 AE 0E 6B 4C 08 0D
8C B4 A3 9F CA 99 CB 8B C2 97 CC AA 84 08 02 0E
7C 73 E2 16 A3 B7 CB 93 D3 B4 C5 DC 9F 0E 79 3E
06 AE B1 9D 93 D3 08 0C BE A3 8F 08 88 BE A3 8D
D3 A8 A3 97 C5 17 89 08 0D 15 D2 08 01 93 C8 AA
8F 0E 61 1B 99 CB 0E 4E BA 9F A1 AE 93 A8 A0 08
02 08 0C E2 16 A3 B7 CB 0F 4F E1 80 05 EC 60 8D
EA 06 D3 E6 0F 8A 00 30 44 65 B9 E4 FE E7 C2 06
CB 82

Looking at the examples above, one can see that the ‘00’ leading bytes were stripped out of the compressed German data. This will be true of all commonly used ascii characters. So far, so good; my assumption holds true.

Looking at the Russian example, we see something slightly different:

The ‘04’ leading bytes were completely stripped out and replaced by a single additional leading ‘tag’: ‘12’ to identify the current code page. All other trailing characters are offset by 0x80. To follow along; in the example above the leading value ‘041C‘ becomes ‘12 9C ‘. 12 replaces the 0x0400 ‘code page identifier’ and ‘1c’ becomes ‘0x1c + 0x80 = 0x9c’, since the following characters are all of the same code page they can be compressed and don’t need any additional leading characters. Here we can see very clearly that compression can work efficiently for even traditional double-byte characters.

However, the Russian example exposes another interesting detail regarding the algorithm; For the characters in the 0x400 code page that are offset by 0x80, what happens when we get to the character that causes it to ‘go over’ 0xFF? Well, the algorithm accounts for this by ‘splitting’ the code page into two (0x0401-0x047F and 0x0480-0x04FF); therefore some characters will get ‘tagged’ with 12 and the rest will be tagged by another unique number. For example: the Cyrillic ‘small letter Ghe with upturn ґ ‘ = 0x0491, this will be represented as: ‘91’ with a different leading tag identifier.

The Japanese example is much more complex, because the datapoints come from many different Japanese ‘code pages’, analogous to the Russian example above. Therefore, there is no consistent leading character as in the Russian and German examples above. So, the algorithm was required to use ‘tags’ as described above, but also ‘special-case delimiters’ to signal the end or start of a codepage. That is; a tag, an ‘I am ending data from this tag’ delimiter, then a new tag, and then the start of the newly compressed trailing characters. One can conclude that certain very large ‘character sets’ like Japanese are difficult to compress using this algorithm.

One can also conclude that a single given Unicode string that contains characters from many different code pages (ie. Russian, Thai, Japanese, Chinese, etc…) could easily end up being larger than the original after the algorithm is done making the required substitutions and additions.

The SQL Server implementation of the algorithm checks to see whether the compressed value is larger than the original and if so, stores the ‘uncompressed value’ instead.

This again, brings up an interesting question; how does one tell a Unicode compressed string from an uncompressed Unicode string, given that both are now possible in the same row (in different columns)?

The SQL Server implementation makes sure that every compressed Unicode string has an uneven number of bytes. And every uncompressed Unicode string has an even number of bytes (as was the case before compression). That means that additional special case characters are used to maintain this ‘uneven’ rule.

There are other SQL Server implementation details such as; there must be at least three consecutive characters from the same code page that will trigger the compression ‘tagging’.

In summary, one can see that there is much more to Unicode compression than just saying: Unicode ‘traditional ascii single-byte’ characters will be converted to storing only one character. In fact, many multi-byte code pages such as Thai, compress down as far as 40%.

For more details please see the algorithm specs at the website posted above.

I have presented a very simplistic view what is happening, but I believe it is enough to gain a rudimentary understanding of how SQL Server 2008 R2 Unicode compression is implemented.

Oh, and if you are wondering why the ISV workload results I was looking at didn’t produce the expected result… This was related to the fact that most of the Unicode data was stored in NVARCHAR(max) and or NTEXT dataypes. These column types do not participate in Unicode compression at this time. This may change in a future version of SQL Server.