SQL Server 2019 CTP 2.0 New Features – Columnstore support for sp_estimate_data_compression_savings

Next up for new Storage Engine area features in SQL Server 2019 is an update to the sp_estimate_data_compression_savings stored procedure to add support for both columnstore and columnstore archive compression. This stored procedure has been around since SQL Server 2008 and is handy when you are evaluating whether to compress tables and indexes. The procedure takes five parameters: schema_name, object_name, index_id, partition_number and data_compression. Prior to SQL Server 2019, the data_compression parameter could be 'ROW', 'PAGE' or 'NONE' only. In SQL Server 2019 CTP 2.0, we added two new options here: 'COLUMNSTORE' and 'COLUMNSTORE_ARCHIVE'.

With rowstore tables and indexes, the way you apply compression is to rebuild the object using the desired compression state. Here's an example that will rebuild a heap table with row compression:

ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = ROW);
GO

Compression ratios are heavily dependent on the type of data that's being stored in the table, so the only good way to estimate compression space savings is to actually compress a sampling of the data and see what the result is. The way sp_estimate_data_compression_savings works for rowstore compression types is to take a sampling of pages from the source object (up to 5,000 pages) and create a sample object in tempdb that mimics the source object. The sample object is then rebuilt using the desired compression state. An estimated compression ratio is calculated by comparing the original sample object size with the compressed sample object size. This ratio is then used to project the estimated size of the entire source object if the desired compression setting is applied. Keep in mind this can also be used to estimate the size of an object if compression is removed by supplying a data_compression parameter value of 'NONE' for a source object that already has compression applied.

For columnstore compression, you don't really "enable" it, you create a new columnstore index that leverages either standard columnstore or columnstore archive compression. Because of this, the logic of sp_estimate_data_compression_savings when using one of the columnstore options is different than when using one of the rowstore options. When you use a data_compression parameter value of 'COLUMNSTORE' or 'COLUMNSTORE_ARCHIVE' for a rowstore source object, we're not comparing the same rowstore object with different compression states, we're comparing the rowstore object to an equivalent columnstore object. The process is roughly the same: sample pages and create a sample object in tempdb that mimics the source object, build an equivalent columnstore index using the same data, compare the size of the source object with the size of the columnstore index to estimate the compression ratio.

So, what do we consider an "equivalent" columnstore object? We had to make some assumptions here to come up with a suitable columnstore reference object based on the source object. The following table illustrates the different reference object types we use to calculate the compression ratio. These apply to both 'COLUMNSTORE' and 'COLUMNSTORE_ARCHIVE' data_compression options:

Source Object Reference Object
Heap Clustered columnstore index
Clustered index Clustered columnstore index
Non-clustered index Non-clustered columnstore index (including the key columns and any included columns of the provided non-clustered index, as well as the partition column of the table, if any)
Non-clustered columnstore index Non-clustered columnstore index (including the same columns as the provided non-clustered columnstore index)
Clustered columnstore index Clustered columnstore index

 

You can also estimate in the other direction from columnstore to rowstore compression states. These apply to all the rowstore options, 'ROW', 'PAGE' and 'NONE':

 

Source Object Reference Object
Clustered columnstore index Heap
Non-clustered columnstore index Non-clustered index (including the columns contained in the non-clustered columnstore index as key columns, and the partition column of the table, if any, as an included column)

 

If you use the 'COLUMNSTORE' or 'COLUMNSTORE_ARCHIVE' options for an existing columnstore source object, the procedure works the same as before, it will create a sample columnstore index and rebuild that index with the desired compression state.

Nothing comes for free of course, and while data compression can save a lot of space and therefore a lot of I/O (both logical and physical), the trade off is generally more CPU to compress and decompress the data. In some cases, it may cause performance issues, particularly for data that is heavily updated. This is doubly so for columnstore. These indexes are designed for mostly read data, and heavy updates against columnstores make the structures much less efficient. Leveraging sp_estimate_data_compression_savings along with careful analysis of your workload patterns can help you make the right index and compression choices for your data and application.