Types of data compression in SQL Server 2008

SQL Server deploys two strategies to compress the data

· First, it stores all fixed length data types in variable length format. If you recall, SQL Server 2005/SP2 provided a new variable length storage format decimal and numeric. Please refer to the series of blogs vardecimal-storage-format for details. SQL Server 2008 extends this to all fixed length data types. Some examples of fixed length data types are integer, char, and float data types . One important point to notes is that even though, the SQL Server stores these data types in variable length format, the semantics of the data type remains unchanged (i.e. it is still the fixed length data type from the perspective of the application). This means that you can avail the benefits for data compression without requiring any changes in your application(s).

 

Let us consider some examples to illustrate the space savings you can get by storing data in variable length storage format.

 

Example-1: Let us say you have a table that has an integer column along with many other columns. If the common integer value of this column can fit in 1 byte (i.e. the value is <= 255), then there is no reason to take 4 bytes to store it. With variable length storage format, SQL Server can shave off 3 bytes from most integer values for this case there by saving you 3 bytes off 4 byte value, a space saving of 75%.

 

Example-2: Let us say you have a table that has a CHAR(100) column. In fixed length format, this column will take 100 characters independent of the actual value stored in it. So if you store “Hello” or “This is a longer string”, both values will take space required to store 100 characters. However, when CHAR (100) is stored in variable length storage format, it will only take 5 characters for the first value (“hello”) and 23 characters for the second value (“This is a longer string”). You can see in this case, we are able to reduce the size of the first value by 95% and for the second value by 77%. A significant space saving.

 

Of course, if the character value in example-2 has many more characters or if the common integer values in example-1 take 2 or more bytes, the space savings will be lower. This means the space savings you can achieve will depend on the data distribution. Other point that is not so obvious is that the space savings achieved will also depend on the schema of the table as well. So for example, if the column was declared to be CHAR (150) instead of CHAR (100), you would get significantly more space (i.e. additional 50 characters) savings. Also, note as a special case, the 0s and NULL values require no storage.

 

Storing values in variable length format will typically get you some space savings, but there is a catch. When a fixed length value is stored in variable length format, the SQL Server needs to store its offset within the row. Note, this offset was not needed when the column value was stored as fixed length because its relative position in the row was always at the same offset. If you recall the record storage format in SQL Server record-structure-blog-post, it takes 2 bytes to store the offset of each variable length column. Given this, it makes little sense to create a column of type varchar(2) as the minimum overhead is 2 bytes already. Same is true for data type such as smallint or even for integer for that matter. SQL Server 2008 addresses this issue by introducing a new record format to be used for compressed data that minimizes the overhead of storing the variable length values. It uses only 4 bits of overhead to store the length of the variable length column that is less than or equal to 8 bytes. Note that the previous record format is fully supported in SQL Server 2008 and is the default record format.

 

This strategy is exposed as ROW compression externally through DDL. Also note that ROW compression is a superset of vardecimal storage format and that vardecimal storage format is fully supported in SQL Server 2008.

 

· Second, it minimizes the data redundancy in columns in one or more rows on a given page. It does that by storing the redundant data once on the page and then referencing it from the multiple columns. So for example, consider the following table

 

Table employee (name varchar(100),

                              status varchar (10) default ‘full time’)

 

Now, if you insert multiple rows into this table, it is possible that many rows will have the same value (i.e. the default) for status column. SQL Server can take advantage of this by storing ‘full time’ value once on the page and then referring to this value from all other occurrences of this value on the same page. As you can imagine, the space savings using this strategy will depend on the amount redundant data on the page.

 

This strategy is exposed as PAGE compression externally through DDL. A customer can choose to enable ROW or PAGE compression on a table or an index or even on an individual partition(s). PAGE compression includes ROW compression.

 

I will provide details on PAGE compression later but for now it will suffice to know that ROW compression means storing fixed length data in variable length storage format and PAGE compression means minimizing the data redundancy.

 

As you can imagine, enabling compression is an expensive operation because the uncompressed data needs to be converted to new record format and for PAGE compression, the data redundancy needs to be minimized. So before you enable compression, you may want to know how much space savings you can get. In the next blog, I will describe how to estimate compression savings.