Columnstore Index Merge Policy for REORGANIZE

In the blog https://blogs.msdn.microsoft.com/sqlserverstorageengine/2016/03/07/columnstore-index-defragmentation-using-reorganize-command/ , we discussed how REORGANIZE command can be used to defragment a columnstore index.  In this blog, we will describe how REORGANIZE processes the compressed rowgroups.

A compressed rowgroup is considered as fragmented when any of the following two conditions is met

  • Less than 1 million rows but the trim_reason ( please refer to https://msdn.microsoft.com/en-us/library/dn832030.aspx ) is other than DICTIONARY_SIZE. If the number of rows in the compressed rowgroup is lower that 1 million because it has reached the maximum dictionary size, then nothing can be done to increase the number of rows.
  • It has nonzero deleted rows that exceeds a minimum threshold.

When REORGANNIZE command runs, it applies the following policy for each of the eligible compressed rowgroups

  • Combine one or more compressed rowgroups such that total number of rows <= 1,024,576.  For example, if you bulk import 5 batches of size 102400, you will get 5 compressed rowgroups. Now if you run REORGANIZE command, these rowgroups will get merged into 1 compressed rowgroup of size 512000 rows assuming there were no dictionary size or memory limitation.
  • When > 102400 rows are marked deleted, a compressed RG is eligible for self-merge. For example, if a compressed row group of 1,048,576 million rows has 110k rows deleted, we can remove the deleted rows and recompress the rowgroup with the remaining rows. It saves on the storage by removing deleted rows
  • If there is a choice between self-merge or merging two or more rowgroups, the merging of two or more rowgroups is favored. For example, if there are two compressed rowgroups RG1 with 500k rows and RG2 with 1,048,576 rows but 60% of the rows are deleted. In this case, instead of self-merging RG2, the merge policy will combine RG1 and RG2 into one compressed rowgroup

Here are some examples

Example: The compressed rowgroup has 10% of 1048576 rows marked as deleted, it qualifies for self-merge to remove the deleted rows. If only say 50k rows were marked deleted, it won't be eligible for self-merge.

Current size Deleted rows Self-Merge
300k 120 k Yes

Example: If a compressed rowgroup has < 90% of the 1048576 rows after accounting for deleted rows, it qualifies for the merge. Here are some examples

Rowgroup1 Rowgroup2 Merge?
950k 920k No (don’t qualify for merge)
900k 900k No (we can’t exceed 1 million in the merge size)
400k 500k Yes, 900k
1 million (200k deleted) 500k (300k deleted) Yes, 1 million

Example:

If you have three rowgroups that qualify for merge, they are considered for merge in sequential order. For example, rowwgroup1 (100k), rowgroup2 (850k) and rowgroup3 (350k) - we will merge the first two qualifying ones.

This  merge policy is just a heuristic and we will evolve it as our experience with customer workload grows.

Thanks

Sunil