Distinct Aggregation Considered Harmful


Distinct aggregation (e.g. select count(distinct key) …) is a SQL language feature that results in some very slow queries.  It’s particularly frustrating that you can take a perfectly efficient query with multiple aggregates, and make that query take forever just by adding a distinct keyword to one of the aggregates.  For instance, it often makes logical sense to compute distinct counts “along the way” while you are computing other aggregates.  Recently, I’ve seen a number of customer queries like this:


select


sum(salary),


max(salary),


count(employeeid),


count(distinct employeeid),


count(distinct salarygrade)


 


In practice, however, SQL Server 2008 does not compute these distinct aggregates “along the way.”  Mixing one or more distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates, leads to spooling and rereading of intermediate results – which can be more expensive than computing the distinct aggregates in a separate query!  In the worst case, such as the fragment of code above, this is sort of inevitable.  The SQL Server 2008 query processor cannot compute aggregates on a stream without destroying the stream.  So computing the distinct aggregate consumes the stream, which has to be recomputed for the other aggregates.


Fortunately, for select lists with only one distinct aggregate, you can rewrite the input SQL in a way that does not have this problem.  SQL Server 2008’s optimizer does not do this rewrite for you.  (Disclaimers:  all the aggregates have to be algebraic [Gray et al 1996], and no guarantees are made that the behavior of the rewrite is exactly the same, particularly when there are arithmetic overflows.)


Getting Rid of Distinct Aggregates


The code below shows an example rewrite using the AdventureWorksDW database distributed with SQL Server 2008.  The rewrite gives me 5x speedup on my desktop even on this small database!  What’s going on is that we are breaking the aggregation into two aggregation steps.  First, we build an intermediate result called PartialSums.  We group together all the values for the distinct aggregate (adding the distinct aggregate’s column to the GROUP BY list), while building partial aggregations for all the non-distinct aggregates (in this example, just count(*)).  Then in the second step, we use the original GROUP BY list, which in this example is empty, and collect the partial aggregates together.  Note how the aggregate functions used depend on the initial aggregates:  count becomes count, followed by sum.  Note that in sum(1), the 1 is actually a constant number value, not a column reference.


As far as I can tell, this rewrite is never worse than the spooled plan.  It parallelizes better, uses tempdb better, and does less logical I/O.


set statistics profile on


set statistics time on


go


use adventureworksdw


go



— Use a distinct aggregate and a normal aggregate in the same select list


— over some complex (one or more joins) derived table



with FISinFRS as (


 select * from factinternetsales FIS


  where ProductKey in


   (select ProductKey from FactResellerSales)


 


)


select


      count(*)                            as CountStar,


      count(distinct ProductKey)          as CountProductKeys


from FISinFRS


go


 



— Now use partial aggregations in a new derived table


— This is equivalent, but SQL Server 2008 does not do this transformation


— for you



with FISinFRS as (


 select * from factinternetsales FIS


  where ProductKey in


   (select ProductKey from FactResellerSales)


 


)


, PartialSums as (


  select


      count(*)                            as CountStarPartialCount


  from FISinFRS


  group by ProductKey


)


select


      sum(CountStarPartialCount)          as CountStar,


      sum(1)                              as CountProductKeys


from PartialSums


 


References


[Gray et al 1996] Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals (1996), ftp://ftp.research.microsoft.com/pub/tr/tr-95-22.pdf.


 


 


By Marc Friedman, 9/2008


Marc.friedman@donotreply.microsoft.com


 


 

Comments (5)

  1. Hi Conor,

    Thanks for the interesting post.  I’m curious as to whether this transform is doable with multiple distinct aggregations, as you mention at the start of the post.

    For example:

    CREATE TABLE sales

    (

    SalesPersonId INT,

    CustomerId INT,

    ProductId INT

    )

    INSERT sales

    SELECT 1, 1, 1

    UNION ALL

    SELECT 1, 2, 1

    UNION ALL

    SELECT 2, 2, 1

    UNION ALL

    SELECT 2, 2, 2

    UNION ALL

    SELECT 3, 1, 1

    UNION ALL

    SELECT 3, 2, 2

    SELECT

    SalesPersonId,

    COUNT(DISTINCT CustomerId),

    COUNT(DISTINCT ProductId)

    FROM sales

    GROUP BY

    SalesPersonId

    … in this particular scenario (and most scenarios I can think of where I have used multiple distinct aggregations), customers and products have an m:n relationship, so if you use customers to do the first partial count you get an invalid number of products, and if you use products you get an invalid number of customers.  Is there some way to re-write the query without distinct aggregation and still get the same result, without doing two queries (which kind of defeats the purpose)?

  2. Apologies, Marc Friedman, for addressing my previous comment to the wrong person.  For some reason I thought this was Conor Cunningham’s blog.

  3. sqlqp@live.com says:

    If there are multiple distinct aggregates (say on columns A and B), there are a couple possibilities.  If they are unrelated columns, SQL Server won’t aggregate on both in one pass, though this is quite possible to achieve with hash aggregation.  On the other hand, if they are slightly correllated (i.e. common recurring pairs), then it might make sense to pre-aggregate on (A, B).  And if there is a functional dependency between A and B, then you can use the above rewrite.  

    — Marc Friedman

  4. 196 Microsoft Team blogs searched, 97 blogs have new articles in the past 7 days. 218 new articles found…

  5. kphani_prasad@hotmail.com says:

    How about using DENSE_RANK()

    select DENSE_RANK() over (  Order by a.col1, a.col2 ) as Ranky, a.col1, a.col2 from Emp

    where …

    group by a.col1, a.col2

    –the above query style is giving distinct rows.