Aggregation

Aggregation refers to the collapse of a larger set of rows into a smaller set of rows.  Typical aggregate functions are COUNT, MIN, MAX, SUM, and AVG.  SQL Server also supports other aggregates such as STDEV and VAR.

I’m going to break this topic down into multiple posts.  In this post, I’m going to focus on “scalar aggregates.”  Scalar aggregates are queries with aggregate functions in the select list and no GROUP BY clause.  Scalar aggregates always return a single row.

Scalar Aggregation

There is only one operator for scalar aggregation: stream aggregate.  For example:

create table t (a int, b int, c int)

select count(*) from t

  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1005],0)))
|--Stream Aggregate(DEFINE:([Expr1005]=Count(*)))
|--Table Scan(OBJECT:([t]))

This is the aggregation equivalent of “Hello World!”  The stream aggregate just counts the number of input rows and returns this result.  The stream aggregate actually computes the count ([Expr1005]) as a bigint.  The compute scalar is needed to convert this result to the expected output type of int.  Note that a scalar stream aggregate is one of the only examples (maybe the only example – I can’t think of any others right now) of a non-leaf operator that can produce an output row even with an empty input set.

It is easy to see how to implement other simple scalar aggregate functions such as MIN, MAX, and SUM.  We can also calculate multiple scalar aggregates at the same time using a single stream aggregate operator:

select min(a), max(b) from t

  |--Stream Aggregate(DEFINE:([Expr1004]=MIN([t].[a]), [Expr1005]=MAX([t].[b])))
|--Table Scan(OBJECT:([t]))

This plan just reads each row of table t and keeps track of the minimum value of column a and the maximum value of column b.  Note that we do not need to convert the result for the MIN and MAX aggregates since the types of these aggregates are computed based on the types of columns a and b.

Some aggregates such as AVG are actually calculated from two other aggregates such as SUM and COUNT:

select avg(a) from t

  |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1005]=(0) THEN NULL ELSE [Expr1006]/CONVERT_IMPLICIT(int,[Expr1005],0) END))
|--Stream Aggregate(DEFINE:([Expr1005]=COUNT_BIG([t].[a]), [Expr1006]=SUM([t].[a])))
|--Table Scan(OBJECT:([t]))

This time the compute scalar also calculates the average from the sum and count.  The CASE expression is needed to make sure that we do not divide by zero.

Although SUM does not need to be computed per se, we still need the count:

select sum(a) from t

  |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1005]=(0) THEN NULL ELSE [Expr1006] END))
|--Stream Aggregate(DEFINE:([Expr1005]=COUNT_BIG([t].[a]), [Expr1006]=SUM([t].[a])))
|--Table Scan(OBJECT:([t]))

The CASE expression uses the count to ensure that SUM returns NULL instead of zero if we have no rows.

Scalar Distinct

Now let’s take a look at what happens if we add a DISTINCT keyword to the aggregate:

select count(distinct a) from t

  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
|--Stream Aggregate(DEFINE:([Expr1007]=COUNT([t].[a])))
|--Sort(DISTINCT ORDER BY:([t].[a] ASC))
|--Table Scan(OBJECT:([t]))

This query must only count rows that have a unique value for column a.  We use the sort operator to eliminate rows with duplicate values of column a.  It is easy to remove duplicate rows once we sort the input set since the duplicates will be adjacent to one another.

Not all distinct aggregates require duplicate elimination.  For example, MIN and MAX behave identically with and without the distinct keyword.  The minimum and maximum values of a set remain the same whether or not the set includes duplicate values.

select min(distinct a), max(distinct b) from t

  |--Stream Aggregate(DEFINE:([Expr1004]=MIN([t].[a]), [Expr1005]=MAX([t].[b])))
|--Table Scan(OBJECT:([t]))

If we have a unique index, we also can skip the duplicate elimination since the index guarantees that there are no duplicates:

create unique index ta on t(a)

select count(distinct a) from t

drop index t.ta

  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
|--Stream Aggregate(DEFINE:([Expr1007]=COUNT([t].[a])))
|--Index Scan(OBJECT:([t].[ta]))

Multiple Distinct

Consider this query:

select count(distinct a), count(distinct b) from t

As we saw above, we can compute “count(distinct a)” by eliminating rows that have duplicate values for column a.  Similarly, we can compute “count(distinct b)” by eliminating rows that have duplicate values for column b.  But, given that these two sets of rows are different, how can we compute both at the same time?  The answer is we cannot.  We must first compute one aggregate result, then the other, and then we must combine the two results into a single output row:

  |--Nested Loops(Inner Join)
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1010],0)))
| |--Stream Aggregate(DEFINE:([Expr1010]=COUNT([t].[a])))
| |--Sort(DISTINCT ORDER BY:([t].[a] ASC))
| |--Table Scan(OBJECT:([t]))
|--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0)))
|--Stream Aggregate(DEFINE:([Expr1011]=COUNT([t].[b])))
|--Sort(DISTINCT ORDER BY:([t].[b] ASC))
|--Table Scan(OBJECT:([t]))

The two inputs to the nested loops join compute the two counts from the original query.  One of the inputs removes duplicates for column a while the other input removes duplicates for column b.  The nested loops join has no join predicate; it is a cross join.  Since both inputs to the nested loops join each produce a single row – they are both scalar aggregates – the result of the cross join is also a single row.  The cross join just serves to “glue” the two columns of the result into a single row.

If we have more than two distinct aggregates on different columns, we just use more than one cross join.  We can also use this type of plan if we have a mix of non-distinct and distinct aggregates.  In that case, one of the cross join inputs aggregates without the sort.

In my next post, I’ll write about aggregates with a GROUP BY clause.