Why does this query consumes so much CPU?

Recently I worked with a customer who reported a slow running query.  Let me simplify this to illustrate the problem.  

There are two tables  t1 (pk int identity primary key, c1 int, c2 int, c3 int, c4 varchar(50)) and  t2 (pk int identity primary key, c1 int, c2 int, c3 int, c4 varchar(50)).  Each table has about 10000 rows. 

But the following query is very slow. 

select  COUNT (*)  from t1 inner join t2 on t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4<>t2.c4

This query runs over 30 seconds and consumes over 30 seconds of CPU.  The query actually returns 0 rows.  With two tables of size of 10,000 each, this seems to be unreasonable. 

When investigate CPU consumption by a query, we normally look at a few things.  First, we look at how many logical reads this query has done.  Secondly, we look at the plan to see how many rows are processed by each operator.

But when we track logical reads via profiler trace (reads column), we see very low logical reads (less than 60).   When we look at the plan, the number of rows processes are not that many either.   The partial execution plan is shown below.

image

For while, we couldn’t figure out what was going on.  Finally upon examining data distribution, we realized the root cause. 

Using a query similar to the following, we found out that majority of the rows from each table (t1 and t2) are the same values. 

select c1, c2,c3, COUNT (*) from t1
group by c1,c2,c3
having COUNT (*) > 1

t1:

image

t2:

image

If it were not the last joining condition (t1.c4<>t2.c4), the query would have produced over 100,000,000 matches because each table has 10000 rows of duplicates.

The plan produced is hash match.  The hash key is on columns c1,c2 and c3 of t2.  But one hash bucket ends up having most of the values.  During probing phase of hash match, for every row retrieved, we need to compare against all the 10,000 duplicates.  That’s where the CPU is spent. This is like processing a cartesian product because of so many duplicates.  It will require process power to do the work.

For this particular query with count aggregate, optimizer can do some trick some times. for example, if one of the tables have very high density, it may choose to do aggregate on that table (called partial aggregate) first and then join another table to get final count. This can help sometimes.  But doing partial aggregate is not always considered as best option.  In this example, even though majority of the data is duplicate, there are many other distinct values. This makes optimizer believe it’s more costly because it will general many small groups as result of partial aggregate.  Therefore, it didn’t generate a plan with partial aggregate.

 

here is a complete demo

use tempdb
go
drop table t1
go
drop table t2

go
create table t1 (pk int identity primary key, c1 int, c2 int, c3 int, c4 varchar(50))
go
create table t2 (pk int identity primary key, c1 int, c2 int, c3 int, c4 varchar(50))
go

begin tran
set nocount on

declare @i int
set @i = 0
while @i< 10000
begin
    if @i < 1000
    begin
        insert into t1 (c1, c2,c3,c4)  select @i,@i,@i, ''
        insert into t2  (c1, c2,c3,c4) select @i,@i,@i, ''
    end
    insert into t1   (c1, c2,c3,c4)  select 100000,20000,30000, ''
    insert into t2   (c1, c2,c3,c4)  select 100000,20000,30000, ''
    set @i = @i + 1
end
commit tran

create index ix on t2 (c1,c2,c3,c4)

go
set statistics profile on
set statistics io on
set statistics time on
go
--this query will consume over 30 seconds of CPU
select  COUNT (*)  from t1 inner join t2 on t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4<>t2.c4
go
set statistics profile off
set statistics io off
set statistics time off