A Unicode Compression example

Now that we have Unicode compression available in SQL Server 2008R2 as described in, let me take a simple example using AdventureWorksDW database to show you the additional compression that can be achieved on tables with one or more columns of type NCHAR or NVARCHAR.

use [AdventureWorksDW2008]

-- the table FactResellerSales is a FACT table with three columns of type

-- NVARCAHR types. Let us find the average length of each of these columns

select AVG( LEN(salesordernumber))

from FactResellerSales

This returns a value of 7. This represents number of character in the string and not the numnber of bytes.

select AVG( LEN(CarrierTrackingNUmber))

from FactResellerSales

This returns a value of 12. This represents number of character in the string and not the numnber of bytes.

select AVG( LEN(CustomerPONumber))

from FactResellerSales

This returns a value of 12. This represents number of character in the string and not the numnber of bytes.

Based on these computations, we have approximately 31 characters of NVARCHAR types. With Unicode compression, we will expect that the average length of the row will reduce further by 31 bytes (i.e. we will save 1 byte per UCS-2 representation). To measure this, we will attach the identical copy of AdventureWorks2008 database to two different instances of SQL Server, one instance of SQL 2008RTM and other instance of SQL 20088R2 and measure the average length of the row after applying the ROW compression. Before applying the ROW compression, let us measure the average length of the row in both instances using the following query

select avg_record_size_in_bytes

from sys.dm_db_index_physical_stats (db_id('AdventureWorksDW2008'),

object_id('dbo.factresellersales'),

null, null, 'SAMPLED')

where object_name (object_id) like '%ResellerSales%' AND index_id =1

In both instances, the average length of the row comes out to be around 191 bytes. This should not surprise you as there are no changes to the uncompressed data. Now, let us execute the command to enable ROW compression in each of the instances using the following command

alter table dbo.factresellersales rebuild

with (

data_compression = ROW)

And, now let us find the average row length in each of the instances. Since Unicode compression is available as part of ROW compression, we will expect that the average length of the row will be around 31 bytes less in SQL2008R2. Let us run the following query again to find the average row length after the row compression

select avg_record_size_in_bytes

from sys.dm_db_index_physical_stats (db_id('AdventureWorksDW2008'),

object_id('dbo.factresellersales'),

null, null, 'SAMPLED')

where object_name (object_id) like '%ResellerSales%' AND index_id =1

Here are the vallues returned

· Average Row length in SQL2008 after row compression = 132 (around 30% compression)

· Average Row length in SQL2008R2 after row compression = 100 (around 45% compression). This extra savings of 32 bytes is what we had expected.

As you can see that in this example,the unicode comrpession gave us extra 15% compression with ROW compression. We tried the Unicode compression with some in-house customer workloads and we observed the compression savings as shown in the table below. Of course, your savings will depend on your data and the schema but the key point to know is that space savings with Unicode compression can be significant if you consider tables in tera-byte range. In our testing we found that the additional impact on CPU is minimal.

ROW Compression

ROW with UNICODE

9%

43%

30%

53.2%

45%

64%

30%

45%

I enccourage you to download CTP2 verision (https://www.microsoft.com/sqlserver/2008/en/us/r2.aspx) of SQL2008R2 and see for yourself the kind of savings you can achieve with your workload.

Thanks

Sunil Agarwal