Aggregation WITH CUBE
In my last post, I wrote about how aggregation WITH ROLLUP works. In this post, I will discuss how aggregation WITH CUBE works. Like the WITH ROLLUP clause, the WITH CUBE clause permits us to compute multiple "levels" of aggregation in a single statement. To understand the difference between these two clauses, let's look at an example. We'll use the same fictitious sales data from last week's example.
CREATE TABLE Sales (EmpId INT, Yr INT, Sales MONEY)
INSERT Sales VALUES(1, 2005, 12000)
INSERT Sales VALUES(1, 2006, 18000)
INSERT Sales VALUES(1, 2007, 25000)
INSERT Sales VALUES(2, 2005, 15000)
INSERT Sales VALUES(2, 2006, 6000)
INSERT Sales VALUES(3, 2006, 20000)
INSERT Sales VALUES(3, 2007, 24000)
Consider the following query from last week:
SELECT EmpId, Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY EmpId, Yr WITH ROLLUP
It will be easier to see what is happening if we pivot the sales data:
|
Yr | ||||
2005 |
2006 |
2007 |
ALL | ||
EmpId |
1 |
12000.00 |
18000.00 |
25000.00 |
55000.00 |
2 |
15000.00 |
6000.00 |
|
21000.00 | |
3 |
|
20000.00 |
24000.00 |
44000.00 | |
ALL |
|
|
|
120000.00 |
The table clearly shows that the WITH ROLLUP clause computes the total for each employee for all years and the grand total for all employees and all years. The query does not compute the totals for each year for all employees. Moreover, the order of the columns in the GROUP BY clause determines in which order the data is totaled.
Now let's repeat the same query but replace the WITH ROLLUP clause with a WITH CUBE clause:
SELECT EmpId, Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY EmpId, Yr WITH CUBE
This query computes all possible sub-totals and totals:
|
Yr | ||||
2005 |
2006 |
2007 |
ALL | ||
EmpId |
1 |
12000.00 |
18000.00 |
25000.00 |
55000.00 |
2 |
15000.00 |
6000.00 |
|
21000.00 | |
3 |
|
20000.00 |
24000.00 |
44000.00 | |
ALL |
27000.00 |
44000.00 |
49000.00 |
120000.00 |
Because the WITH CUBE clause causes the query to compute all possible totals, the order of the columns in the GROUP BY clause does not matter. Of course, by default, SQL Server does not pivot the results of either of the above queries. Here is the actual output from the WITH CUBE query:
EmpId Yr Sales
----------- ----------- ---------------------
1 2005 12000.00
1 2006 18000.00
1 2007 25000.00
1 NULL 55000.00
2 2005 15000.00
2 2006 6000.00
2 NULL 21000.00
3 2006 20000.00
3 2007 24000.00
3 NULL 44000.00
NULL NULL 120000.00
NULL 2005 27000.00
NULL 2006 44000.00
NULL 2007 49000.00
Next, let's look at the query plan for the WITH CUBE query:
|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1005]=(0) THEN NULL ELSE [Expr1006] END))
|--Concatenation
|--Stream Aggregate(GROUP BY:([Sales].[EmpId], [Sales].[Yr]) DEFINE:([Expr1005]=SUM([Expr1007]), [Expr1006]=SUM([Expr1008])))
| |--Sort(ORDER BY:([Sales].[EmpId] ASC, [Sales].[Yr] ASC))
| |--Table Spool
| |--Stream Aggregate(GROUP BY:([Sales].[Yr], [Sales].[EmpId]) DEFINE:([Expr1007]=COUNT_BIG([Sales].[Sales]), [Expr1008]=SUM([Sales].[Sales])))
| |--Sort(ORDER BY:([Sales].[Yr] ASC, [Sales].[EmpId] ASC))
| |--Table Scan(OBJECT:([Sales]))
|--Compute Scalar(DEFINE:([Expr1012]=NULL))
|--Stream Aggregate(GROUP BY:([Sales].[Yr]) DEFINE:([Expr1005]=SUM([Expr1007]), [Expr1006]=SUM([Expr1008])))
|--Table Spool
This plan consists of two parts. SQL Server has effectively rewritten our query as follows:
SELECT EmpId, Yr, SUM(Sales) AS Sales
FROM Sales
GROUP BY EmpId, Yr WITH ROLLUP
UNION ALL
SELECT NULL, Yr, SUM(Sales)
FROM Sales
GROUP BY Yr
The first part of the plan computes the result for the WITH ROLLUP query above. I described how this query plan works in last week's post. The second part of this plan computes the missing year sub-totals yielding the entire CUBE result. Note that this plan employs a common sub-expression spool. As I discussed in this post, a common sub-expression spool copies its input rows into a worktable and then reads and returns the rows from the worktable multiple times - in this case twice. The spool is meant to improve performance although, in this example, it has little impact since the server could just as easily have re-read the original Sales table. However, if the input to the aggregation was more complex and cost more to evaluate, the spool would help.
If we use the WITH CUBE clause when aggregating on more than two columns, SQL Server simply generates increasingly complex plans with additional inputs to the concatentation operator. As with the simple two column example, the idea is to compute the whole CUBE by computing all of the individual ROLLUPs that compose it.
Finally, we can actually combine WITH CUBE and PIVOT to generate the above table in a single simple statement. (I actually proposed a variation of this query in an answer to a reader's comment on my post about the PIVOT operator but I like this solution better.)
SELECT EmpId, [2005], [2006], [2007], [ALL]
FROM
(
SELECT
CASE WHEN GROUPING(EmpId) = 0
THEN CAST (EmpId AS CHAR(7))
ELSE 'ALL'
END AS EmpId,
CASE WHEN GROUPING(Yr) = 0
THEN CAST (Yr AS CHAR(7))
ELSE 'ALL'
END AS Yr,
SUM(Sales) AS Sales
FROM Sales
GROUP BY EmpId, Yr WITH CUBE
) AS s
PIVOT (SUM(Sales) FOR Yr IN ([2005], [2006], [2007], [ALL])) AS p
Here is the output from this query:
EmpId 2005 2006 2007 ALL
------- --------------------- --------------------- --------------------- ---------------------
1 12000.00 18000.00 25000.00 55000.00
2 15000.00 6000.00 NULL 21000.00
3 NULL 20000.00 24000.00 44000.00
ALL 27000.00 44000.00 49000.00 120000.00