Distinct Aggregation Considered Harmful

Distinct aggregation (e.g. select count(distinct key) …) is a SQL language feature that results in some very slow queries. It’s particularly frustrating that you can take a perfectly efficient query with multiple aggregates, and make that query take forever just by adding a distinct keyword to one of the aggregates. For instance, it often makes logical sense to compute distinct counts “along the way” while you are computing other aggregates. Recently, I've seen a number of customer queries like this:

select

sum(salary),

max(salary),

count(employeeid),

count(distinct employeeid),

count(distinct salarygrade)

In practice, however, SQL Server 2008 does not compute these distinct aggregates “along the way.” Mixing one or more distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates, leads to spooling and rereading of intermediate results – which can be more expensive than computing the distinct aggregates in a separate query! In the worst case, such as the fragment of code above, this is sort of inevitable. The SQL Server 2008 query processor cannot compute aggregates on a stream without destroying the stream. So computing the distinct aggregate consumes the stream, which has to be recomputed for the other aggregates.

Fortunately, for select lists with only one distinct aggregate, you can rewrite the input SQL in a way that does not have this problem. SQL Server 2008’s optimizer does not do this rewrite for you. (Disclaimers: all the aggregates have to be algebraic [Gray et al 1996], and no guarantees are made that the behavior of the rewrite is exactly the same, particularly when there are arithmetic overflows.)

Getting Rid of Distinct Aggregates

The code below shows an example rewrite using the AdventureWorksDW database distributed with SQL Server 2008. The rewrite gives me 5x speedup on my desktop even on this small database! What’s going on is that we are breaking the aggregation into two aggregation steps. First, we build an intermediate result called PartialSums. We group together all the values for the distinct aggregate (adding the distinct aggregate’s column to the GROUP BY list), while building partial aggregations for all the non-distinct aggregates (in this example, just count(*)). Then in the second step, we use the original GROUP BY list, which in this example is empty, and collect the partial aggregates together. Note how the aggregate functions used depend on the initial aggregates: count becomes count, followed by sum. Note that in sum(1), the 1 is actually a constant number value, not a column reference.

As far as I can tell, this rewrite is never worse than the spooled plan. It parallelizes better, uses tempdb better, and does less logical I/O.

set statistics profile on

set statistics time on

go

use adventureworksdw

go

--

-- Use a distinct aggregate and a normal aggregate in the same select list

-- over some complex (one or more joins) derived table

--

with FISinFRS as (

 select * from factinternetsales FIS

  where ProductKey in

   (select ProductKey from FactResellerSales)

 

)

select

      count(*) as CountStar,

      count(distinct ProductKey) as CountProductKeys

from FISinFRS

go

--

-- Now use partial aggregations in a new derived table

-- This is equivalent, but SQL Server 2008 does not do this transformation

-- for you

--

with FISinFRS as (

 select * from factinternetsales FIS

  where ProductKey in

   (select ProductKey from FactResellerSales)

 

)

, PartialSums as (

  select

      count(*) as CountStarPartialCount

  from FISinFRS

  group by ProductKey

)

select

      sum(CountStarPartialCount) as CountStar,

      sum(1) as CountProductKeys

from PartialSums

 

References

[Gray et al 1996] Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals (1996), ftp://ftp.research.microsoft.com/pub/tr/tr-95-22.pdf.

 

 

By Marc Friedman, 9/2008

Marc.friedman@donotreply.microsoft.com