Aggregation WITH ROLLUP

Article
09/21/2007

In this post, I'm going to discuss how aggregation WITH ROLLUP works. The WITH ROLLUP clause permits us to execute multiple "levels" of aggregation in a single statement. For example, suppose we have the following fictitious sales data. (This is the same data that I used for my series of posts on the PIVOT operator.)

CREATE TABLE Sales (EmpId INT, Yr INT, Sales MONEY)
INSERT Sales VALUES(1, 2005, 12000)
INSERT Sales VALUES(1, 2006, 18000)
INSERT Sales VALUES(1, 2007, 25000)
INSERT Sales VALUES(2, 2005, 15000)
INSERT Sales VALUES(2, 2006, 6000)
INSERT Sales VALUES(3, 2006, 20000)
INSERT Sales VALUES(3, 2007, 24000)

We can write a simple aggregation query to compute the total sales by year:

SELECT Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY Yr

As expected, this query returns three rows - one for each year:

 Yr          Sales
----------- ---------------------
2005        27000.00
2006        44000.00
2007        49000.00

The query plan is a simple stream aggregate:

|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
       |--Stream Aggregate(GROUP BY:([Sales].[Yr]) DEFINE:([Expr1010]=COUNT_BIG([Sales].[Sales]), [Expr1011]=SUM([Sales].[Sales])))
            |--Sort(ORDER BY:([Sales].[Yr] ASC))
                 |--Table Scan(OBJECT:([Sales]))

Now suppose that we want to compute not just the sale by year but the total sales as well. We could write a UNION ALL query:

SELECT Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY Yr
UNION ALL
SELECT NULL, SUM(Sales) AS Sales
FROM Sales

This query works and does give the correct result:

 Yr          Sales
----------- ---------------------
2005        27000.00
2006        44000.00
2007        49000.00
NULL        120000.00

However, the query plan performs two scans and two aggregations (one to compute the sales by year and one to compute the total sales):

|--Concatenation
       |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1023]=(0) THEN NULL ELSE [Expr1024] END))
       |    |--Stream Aggregate(GROUP BY:([Sales].[Yr]) DEFINE:([Expr1023]=COUNT_BIG([Sales].[Sales]), [Expr1024]=SUM([Sales].[Sales])))
       |         |--Sort(ORDER BY:([Sales].[Yr] ASC))
       |              |--Table Scan(OBJECT:([Sales]))
       |--Compute Scalar(DEFINE:([Expr1010]=NULL))
            |--Compute Scalar(DEFINE:([Expr1009]=CASE WHEN [Expr1025]=(0) THEN NULL ELSE [Expr1026] END))
                 |--Stream Aggregate(DEFINE:([Expr1025]=COUNT_BIG([Sales].[Sales]), [Expr1026]=SUM([Sales].[Sales])))
                      |--Table Scan(OBJECT:([Sales]))

We can do better by adding a WITH ROLLUP clause to the original query:

SELECT Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY Yr WITH ROLLUP

This query is simpler to write and uses a more efficient query plan with only a single scan:

|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1005]=(0) THEN NULL ELSE [Expr1006] END))
       |--Stream Aggregate(GROUP BY:([Sales].[Yr]) DEFINE:([Expr1005]=SUM([Expr1007]), [Expr1006]=SUM([Expr1008])))
            |--Stream Aggregate(GROUP BY:([Sales].[Yr]) DEFINE:([Expr1007]=COUNT_BIG([Sales].[Sales]), [Expr1008]=SUM([Sales].[Sales])))
                 |--Sort(ORDER BY:([Sales].[Yr] ASC))
                      |--Table Scan(OBJECT:([Sales]))

The bottom stream aggregate in this query plan is the same as the stream aggregate in the original non-ROLLUP query. This aggregation is a normal aggregation and, as such, it can be implemented using a stream aggregate (as in this example) or a hash aggregate (try adding an OPTION (HASH GROUP) clause to the above query). It can also be parallelized.

The top stream aggregate is a special aggregate that computes the ROLLUP. (Unfortunately, in SQL Server 2005 there is no way to discern from the query plan that this aggregate implements a ROLLUP. This issue will be fixed in SQL Server 2008 graphical and XML plans.) A ROLLUP aggregate is always implemented using stream aggregate and cannot be parallelized. In this simple example, the ROLLUP stream aggregate merely returns each pre-aggregated input row while maintaining a running total of the Sales column. After outputting the final input row, the aggregate also returns one additional row with the final sum. Since SQL lacks a concept of an ALL value, the Yr column is set to NULL for this final row. If NULL is valid value for Yr, we can identify the ROLLUP row using the GROUPING(Yr) construct.

SELECT
      CASE WHEN GROUPING(Yr) = 0
            THEN CAST (Yr AS CHAR(5))
            ELSE 'ALL'
      END AS Yr,
      SUM(Sales) AS Sales
FROM Sales
GROUP BY Yr WITH ROLLUP

 Yr    Sales
----- ---------------------
2005  27000.00
2006  44000.00
2007  49000.00
ALL   120000.00

We can also compute multiple ROLLUP levels in a single query. For example, suppose that we want to compute the sales first by employee and then for each employee by year:

SELECT EmpId, Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY EmpId, Yr WITH ROLLUP

 EmpId       Yr          Sales
----------- ----------- ---------------------
1           2005        12000.00
1           2006        18000.00
1           2007        25000.00
1           NULL        55000.00
2           2005        15000.00
2           2006        6000.00
2           NULL        21000.00
3           2006        20000.00
3           2007        24000.00
3           NULL        44000.00
NULL        NULL        120000.00

There are a couple of points worth noting about this query. First, since the combination of the EmpId and Yr columns is unique, in the absence of the WITH ROLLUP clause, this query would just return the original data. However, with the WITH ROLLUP clause the query produces a useful result. Second, the order of the columns in the GROUP BY clause is relevant with the WITH ROLLUP clause. To see why simply try the same query but reverse the EmpId and Yr columns. Instead of computing the sales first by employee it will compute the sales first by year.

The query plan for this query is identical to the query plan for the prior query except that it groups on both the EmpId and Yr columns instead of on just the EmpId column. Like the prior query plan, this query plan includes two stream aggregates: the bottom one which is a normal stream aggregate and the top one which computes the ROLLUP. This ROLLUP stream aggregate actually computes two running totals: one which computes the total sales for an employee for all years and one which compute the total sales for all employees and all years. This table shows how the ROLLUP computation proceeds:

EmpId	Yr	SUM(Sales) BY EmpId, Yr	SUM(Sales) BY EmpId	SUM(Sales)
1	2005	12000.00	12000.00	12000.00
1	2006	18000.00	30000.00	30000.00
1	2007	25000.00	55000.00	55000.00
1	NULL		55000.00	55000.00
2	2005	15000.00	15000.00	70000.00
2	2006	6000.00	21000.00	76000.00
2	NULL		21000.00	76000.00
3	2006	20000.00	20000.00	96000.00
3	2007	24000.00	44000.00	120000.00
3	NULL		44000.00	120000.00
NULL	NULL			120000.00

In my next post, I'll take a look at the WITH CUBE clause. I'll discuss how it differs from WITH ROLLUP both in terms of function and in terms of its implementation.

Aggregation WITH ROLLUP

Additional resources