SELECT from a view slower than "equivalent" SELECT from the base table

Sometime we get complaints that a query is slower than it could be because a filter isn’t pushed very deeply down into a plan. For example, consider this hypothetical poor performance scenario (my apologies in advance for the lack of normalization):

USE tempdb

IF OBJECT_ID ('Sales') IS NOT NULL DROP TABLE Sales

IF OBJECT_ID ('SalesSummary') IS NOT NULL DROP VIEW SalesSummary

CREATE TABLE Sales (

  SalesPerson varchar(30) NOT NULL,

  SalesAmount money NOT NULL, Comments char(200))

INSERT INTO Sales VALUES ('Green ', 19011.87, '')

INSERT INTO Sales VALUES ('Green', 2478.42, '')

INSERT INTO Sales VALUES ('Green ', 1975.11, '')

INSERT INTO Sales VALUES ('White', 3007.01, '')

INSERT INTO Sales VALUES ('White', 5312.44, '')

INSERT INTO Sales VALUES ('Brown', 843.20, '')

CREATE INDEX idx ON Sales (SalesPerson, SalesAmount)

GO

CREATE VIEW SalesSummary AS

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

GROUP BY SalesPerson

GO

SET STATISTICS PROFILE ON

GO

-- This query uses an index seek to retrieve only the rows

-- where SalesPerson = 'Green'

SELECT *

FROM SalesSummary

WHERE SalesPerson = 'Green'

GO

-- This query uses an index scan, then filters the rows later.

SELECT *

FROM SalesSummary

WHERE SalesPerson LIKE 'Green'

GO

SET STATISTICS PROFILE OFF

GO

The first query filters on “SalesPerson = 'Green'”. The second filters on “SalesPerson LIKE 'Green'”. Here’s the first query’s plan:

  |--Stream Aggregate(DEFINE:([Expr1004]=SUM([Sales].[SalesAmount]), [Sales].[SalesPerson]=ANY([Sales].[SalesPerson])))

       |--Index Seek(OBJECT:([Sales].[idx]), SEEK:([Sales].[SalesPerson]='Green') ORDERED FORWARD)

You can see that the first thing the plan does is a very efficient index seek to narrow the set of rows down to those that pass the “SalesPerson = ‘Green’” filter. Then a Stream Aggregate operator computes the SUM(SalesAmount) expression for each SalesPerson returned by the index seek.

The second plan, though, scans every row in the table and computes the SUM for every SalesPerson. Only after it has scanned and aggregated every row does it filter out those values that don’t survive the LIKE predicate.

  |--Filter(WHERE:([Sales].[SalesPerson] like 'Green'))

       |--Stream Aggregate(GROUP BY:([Sales].[SalesPerson]) DEFINE:([Expr1004]=SUM([Sales].[SalesAmount])))

            |--Index Scan(OBJECT:([Sales].[idx]), ORDERED FORWARD)

Here’s a general rule off thumb you can follow when looking for tuning opportunities in a query plan: for best performance, you usually want to push the most selective predicates as deeply as possible into the plan. If you do the most selective operation first, the remaining operators have fewer rows to process, and that means faster overall query execution. Based on this rule of thumb, the first plan here is clearly preferable from a performance perspective. So what gives? Why does the use of “LIKE” instead of “=” make SQL refuse to push the filter down? Even more baffling, why do you get the more efficient plan when you bypass the view and select directly from the table, even if you use "LIKE"? Run this and you’ll see what I mean:

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

WHERE SalesPerson LIKE 'Green'

GROUP BY SalesPerson

This uses an efficient index seek-based plan, just like the first query.

There are several things going on here:

  • Views must behave like a table

  • The “LIKE” and “=” operators use subtly different rules for string matching

  • GROUP BY uses the same string comparison rules as “=” for the purposes of determining which rows end up in the same bucket

What I mean by “views must behave like a table” is that the output of a select from a view must be the same as what you could get by materializing the view (e.g. selecting it into a temp table), then querying the materialized view. In this case, the GROUP BY in the view could return a different total sales amount for a SalesPerson if SQL chose a plan that pushed a LIKE predicate below the Stream Aggregate. Here’s proof:

SELECT *

FROM SalesSummary

WHERE SalesPerson LIKE 'Green '

-- Query 1 output:

SalesPerson TotalSales

------------------------------ ---------------------

Green 23465.4000

SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

FROM Sales

WHERE SalesPerson LIKE 'Green '

GROUP BY SalesPerson

-- Query 2 output:

SalesPerson TotalSales

------------------------------ ---------------------

Green 20986.9800

The first query selects from the view, while the second moves the view logic into the query and selects directly from the base table. Other than that, they are identical, yet the SUM(SalesAmount) calculation is different. Recall that I mentioned that “=” and “LIKE” have different string comparison semantics. In particular, LIKE considers trailing blanks in the right-hand operand to be significant, while the “=” operator ignores trailing blanks. For the set of three rows with group ID “Green”, only two will qualify for the filter “WHERE SalesPerson LIKE ‘Green ‘” because they have trailing blanks. The third “Green” row doesn’t have any trailing blanks and won’t survive the LIKE filter. When you push this LIKE filter below the aggregate, you end up SUMming a different set of rows. That’s not allowed if the GROUP BY is part of a view; if a WHERE clause applied to a view can change a property of a row instead of just filtering it out, it would mean that the view didn’t behave like a materialized table. Put another way, a filter on a view that includes a GROUP BY is only allowed to eliminate entire groups; it's not legal for it to eliminate some base rows in a group but retain others, changing the group's membership.  It’s therefore by design that the slower scan-based plan is selected for the select from the view with LIKE. In contrast, the query from the view with the "SalesPerson='Green'" filter can be pushed because GROUP BY uses the same string comparison rules as the "=" operator. It's safe for the QO to assume that pushing the "=" filter below the Stream Aggregate will not change the view's semantics. 

This isn’t just about trailing blanks – you can see the attached script for a couple of examples that demonstrate the exact same thing (pushing LIKE below an aggregate changes the output of the aggregate) for a couple of interesting non-blank characters. And it isn’t only about “LIKE” vs. “=”, either; this is just the example that was close at hand when I wrote this (we have a case open for this scenario right now).

Finally, be aware that derived tables (and CTEs) also provide the same guarantee. For example, note that this query selects from the base table but also does a full scan followed by filter, just like the above select from the view:

 

SELECT *

FROM (

   SELECT SalesPerson, SUM (SalesAmount) AS TotalSales

   FROM Sales

   GROUP BY SalesPerson) AS t

WHERE SalesPerson LIKE 'Green'

So, to net out all of this: Generally you want your filters to be pushed deep into the query plan -- as deeply as possible. But when you’re selecting from a view, there will be some limits to what can be pushed. Some filters can’t be pushed beneath parts of the view without changing the view’s semantics, and that would break a contract that SQL is required to maintain.

 

 

UPDATE (2 March 2009): Fabiano Amorim pointed out that the Query #1 and Query #2 use the same plan on SQL 2008.  He's right; a new performance optimization causes the LIKE predicate to be pushed below the GROUP BY's aggregate operator.  I think this is actually a bug -- it does result in a faster plan, but it breaks the "views behave like tables" rule that SQL follows in all other cases.  The general rule stands: not all predicates can be pushed below a view's GROUP BY, even in SQL 2008 with this fairly aggressive performance optimization.  (And don't be surprised if this optimization gets removed in a future release. ;)

 

like_views_nopush.sql