Most of us are familiar with data stored as rows. A row stores all the columns either by value or by reference. One or more rows are packed into pages and stored on the media. When a row needs to be accessed, the page containing the row is brought into memory. Relational database systems have used this model since 1970s very successfully both for transactional and analytics workload. So the question really is what is so different today?
Well, the rowstore model continues to work well for transactional workload where a typical transaction touches only a small set of rows. For example, consider an order management application where each customer places his/her order and checks its status. The traditional rowstore model provides very efficient execution of DML (i.e. Delete, Modify and Insert) operations as well as for fetching a small set of rows. However, rowstore increasingly is falling short in delivering acceptable performance for analytic queries which need to process a large number of rows (think billions) in today’s world where the volume of data to be processed is increasing exponentially. Some examples, data generated by machines in a factory could be used to predict when it needs to be serviced to avoid downtime or credit card transactions that could be used to understand the buying patterns of consumers and could be used to provide targeted promotion. Most IT departments are faced with the challenge of managing such large volume of data and to provide interactive response time (think seconds) for analytics. Columnstore index is one of the key components to help achieve these objectives.
What is columnstore?
The columnstore represents a new storage format of storing relational data. It stores data as columns instead of rows as shown in the picture below representing a relational table with 5 columns. Note, though the data is stored as columns, you can still query the table to fetch the row with all its columns but it is more expensive to rowstore where all the columns are stored together.
You may be surprised to know that column storage was first introduced in 1969 https://en.wikipedia.org/wiki/Column-oriented_DBMS but only in the last 10 years, it has been adopted in commercial databases.
The columnstore storage offers three key benefits
- Reduced Storage: Since data is stored as individual columns, it compresses really well as all values are drawn from the same domain (i.e data type) and in many cases, the values repeat or are similar. The compression will depend on the data distribution but typical compression that we have seen is around 10x. This is significant because it enables you reduce the storage footprint of your database significantly.
- Reduced IO due to compression: Analytics queries typically process a large number of rows and are commonly IO bound. A quick way to speed these queries is by reducing IOs. High data compression speeds up analytic queries because of the proportional reduction in IO as well the increased likelihood that the requested data may be found in-memory.
- Only Referenced columns need to be fetched: Most analytics queries fetch/process only a small set of columns. If you consider a typical star schema https://en.wikipedia.org/wiki/Star_schema, the FACT table is the one with most rows and it has large number of columns. With columnstore storage, SQL Server needs to fetch only the referenced columns unlike rowstore where the full row needs to be fetched regardless of number of columns referenced in the query. For example, consider a FACT table with 100 columns and an analytic query accessing this tables references only 5 columns. Now, by fetching only the referenced columns, you can potentially reduce IO by 95% with simplifying assumption that all columns take same storage. Note, this is on top of already 10x data compression provided by columnstore.
In subsequent blogs, I will describe how columnstore technology has been implemented in SQL Server to deliver the industry leading performance for analytics.