Columnstore index, both clustered and nonclustered, can get fragmented like any other index. SQL Server 2016 allows customer to defragment this index using the familiar ALTER INDEX <index-name> REORGANIZE command instead of using a heavy hammer approach of rebuilding the index. Both create/rebuild of columnstore index is an OFFLINE operation at this time but you can defragment columnstore index ONLINE.
A columnstore index is considered fragmented if it has
- Multiple delta rowgroups. The delta rowgroups are not necessarily a negative as they can be used to minimize the impact on transactional workload in real-time operational analytics. If your goal is to speed up the analytics queries to what is possible, then you can get the best performance when all the rows are in the compressed rowgroups. Besides speeding up the queries, compressed rowgroups typically take 10x less storage than delta rowgroup and proportionally less memory when they are scanned.
- Deleted rows: when a rowgroup is compressed, it is marked read-only. Unlike when the rows are stored in rowstore in pages, there is no easy way to remove compressed row. I am sure you don’t want uncompress a rowgroup, delete the row and then re-compress it as this will be prohibitively expensive. What SQL Server does instead it store the information to identify the row in an internal btree structure referred to as delete-bitmap and delete-buffer. While this is more efficient for the delete operation, all analytic queries must filter the deleted rows out by internally executing ‘anti-semijoin’ before the query results can be returned. Ideally, you don’t want any deleted rows.
- Ideal compressed rowgroup has a size of 1 million rows (1024*1024 = 1048576) and most compressed rowgroups are of this size but they can be smaller for multiple reasons. For example, if you are bulk importing data and your batchsize is set to 102400, it will generate compress rowgroup of size 102400. You could choose a batchsize in the multiples of 1 million but more often than not, you have little control on how the source files are generated. There are also situations such as hitting the dictionary size limit where you can’t really much to change it. Please refer to https://blogs.msdn.microsoft.com/sqlserverstorageengine/2014/05/03/clustered-column-store-factors-that-impact-size-of-a-rowgroup/ for some examples.
Let us now look at how you can use REORGANIZE command to defragment your columnstore index. Note, this command is only supported for clustered columnstore index (CCI) and nonclustered columnstore index for disk-based tables. In the example below, I create an empty table and then create a clustered columnstore index and finally I load 300k rows. SQL Server 2016 loads data from staging table into CCI in parallel when you specify TABLOCK hint. The machine I ran this test on has 4 logical processors so the 300k rows got divided into 75k each between 4 threads. Since each thread was loading < 102400 rows, the columnstore index ends up with 4 delta rowgroups as shown below.
Now suppose I don’t need to load any more data, but I want to compress these rows into compressed rowgroups. The new choice in SQL Server 2016 is to use REORGANIZE command as shown below. Note, you need to use the option COMRPESSS_ALL_ROW_GROUPS to compress delta rowgroups.
-- run this to compressed OPEN RGs
alter index ccitest_temp_cci on ccitest_temp reorganize with (COMPRESS_ALL_ROW_GROUPS = ON)
Now, if you look at the rowgroups again, you will see that delta rowgroups have been compressed. Existing delta rowgroups transition into TOMBSTONE state and removed later when there is no reference to them.
While the compression is good, but notice that each of the compressed rowgroup has only 75k rows. Ideally, you want to combine all these rowgroups into 1 compressed rowgroup. Now, let us delete some rows. You will see that the 30000 rows are marked deleted in the first compressed rowgroup but the original 75k rows still remain compressed. This is because compressed rowgroups are read-only.
-- delete some rows
delete top (30000) from ccitest_temp
You can do that by running the command. Note, I did not use option COMRPESSS_ALL_ROW_GROUPS as I only want to combine the smaller compressed rowgroups.
alter index ccitest_temp_cci on ccitest_temp reorganize
The resultant rowgroups shows 1 single compressed rowgroup with 270k rows and all other compressed rowgroups are ‘tombstone’d meaning they are ready to be garbage collected and removed in the background.
This was a simple example to show how REORGANIZE command can combine multiple compressed rowgroups into one and remove deleted rows. REORGANIZE command follows a MERGE policy on how it chooses to remove deleted rows or combine the compressed rowgroups. The merge policy used by REORGANIZE command is described @ https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/03/08/columnstore-index-merge-policy-for-reorganize/
In summary, for common customer scenarios, once you create CCI or NCCI which is an OFFLINE operation, you can maintain the columnstore index ONLINE using REORGANIZE command. Note, this command is NOT supported on memory-optimized tables. For memory-optimized tables, when the number of deleted rows exceed an internal threshold, the compression rowgroup is removed after re-inserting the remaining rows into the ‘tail’ (i.e. the virtual delta rowgroup).