Compression: Whats the difference between SQL2005 and SQL2008?

Looks like some folks are confused about the functionality and differences so let me try and explain and give some pointers to great blog entries for more details.

 

SQL2005

Compression was added to SQL Server 2005 as part of SP2 and it was added via a new "type". The primary goal here was to catch up on some of the changes in customer apps during the development life of SQL 2005. One of the problems with a 5 yr release is at some point you get locked into your feature set and just have to ship, but over the course of a long release its possible and in fact almost certain that there will be changes in apps that you are unable to react to. One of the big changes between 2000 and 2005 was the amount of data being stored, especially in Data Warehouses. Fact tables with millions or even billions of rows have become common place and the cost of storing and backing up those fact tables gets to be a real problem.

Hence the VarDecimal type was introduced, to put it simply it strips off unused bytes(leading and trailing) from numbers stored in a given column. Its a relatively simple change and hence viable in an SP, but it can make a big difference, I've seen reports of everything from 10% to 60% saving in the space requirements for a fact table.(makes sense if you think about a classic fact table which has lots of dimension columns which are likely int based surrogate keys and then lots of float columns for storing measure data)

Sunil has a great series of blog posts on VarDecimal over on the storage engine blog, you can find them here.

 

SQL2008

Ok so now we have SQL2005's implementation, what was done and why for SQL2008?

First of all I want to say I am very happy that this feature(set of improvements) looks like it has made it into the product, I remember all the debates about trying to add compression late in SQL 2005 and then when doing planning for SQL2008, there was a surprising amount of passion including push back to not do the feature, but at some point it crossed over into the "we just gotta do this no matter what" bucket and bypassed a bunch of approval process. From what I have seen of the implementation (and I have not yet tried it on any large systems) so far the team knocked it out of the park of completeness and flexibility while keeping it simple.

There are 2 basic types of compression in SQL 2008, data and backup:

 

DATA COMPRESSION

This can occur at 2 levels, the first is row level. To put it simply row compression uses a variety of techniques to more efficiently store a row of data, it does htis by trying to reduce data storage to the min number of bits/bytes needed. In some ways think of this as a VAR row as a VARCHAR is to a CHAR, row compression can reduce the data and metadata overhead at the row level depending on actual data. The next option is page, page level compression includes row compression and adds to other techniques, prefix compression and dictionary compression.

I think about dictionary compression as a way to per page eliminate duplicate data by adding an entry to the data in the dictionary and then referencing it in each column. Think about a Data Warehouse dimension table in a star (rather than snowflake) schema, there is likely to be lots of data thats duplicated, thinking about Safeway/Costco/Best Buy etc like retail chain, these companies can have product dimensions with millions of members, likewise customer dimensions. In the case of product its likely that names like Nabisco, Sony etc appear more than the odd one or two times, in the case of customer, state and town names are going to appear a LOT. In all these cases these names should appear once per page and every other instance effectively becomes a pointer.

For prefix compression this is simply defining a common root for words/data and making that a reference and then explicitly storing the suffix. Another good example here could be SKU/Product codes in a DW where the root of an 11-17 char code could have 6/7 common chars.

Now both of these are simplistic explanations of what's going on and I'm dragging up memories of early conversations with folks in the Storage Engine on this from 2 years ago. Chad Boyd has a much deeper explanation here.

 

BACKUP COMPRESSION

There is not a lot to say here, which is one of the reasons I love the feature, its either on or off and it compresses the backup stream before it hits the disk. When we were originally looking at this feature there was something like 4 different levels of compression and all sorts of trade off on compression vs CPU, buts it my understanding that as it will appear all the options have been removed and the system will take care of choosing the right compression ratio. I love it! (If you have ever heard Dave Campbell's TV technology through ages analogy this is another classic example of removing the horizontal and vertical hold options)

 

WHY?

Disk is cheap right? I can get 1tb drives from NewEgg for less than a tank of gas (I suspect that's actually true in WA state this week) Well some disks are cheap, the price and performance of SATA drives these days is certainly a wowser, but SANs are still very expensive and the amount of data we are storing is increasing at a scary rate. So its actually more important than you would think. However there is another reason and that's simple maths, data costs money, it costs to store on disk, it costs to read from disk, it costs to store in memory, it costs to write to disk, etc etc, you get the idea. Hence what compression in SQL Server allows us to do is to perform the classic compression trade off and that's CPU vs storage. If you compress something it takes CPU to perform compression, but then you read and write less data. Given the amount of CPU/Core power available these days it seems very viable to give you the choice to trade off some spare CPU for some better memory and IO usage.

Sunil is starting a new series of compression posts here, if its like his vardecimal posts they are going to be great reading.

 

So how should you think about the SQL2005 vs SQL2008 versions of compression? Well I'm in a car analogy mode this week as I have been catching up on my Top Gear watching, I think of 2005 as an American muscle car from the 70s, Charger/Corvette/Mustang/... its not subtle and its not going to be everyone's cup of tea but its brute force approach can be very effective and appealing. For 2008 there is a lot more finesse, more technology involved it feels warmer and fuzzier but WOW when it gets going, new BMW M5 anyone?

 

Let car flames commence...