Subqueries in BETWEEN and CASE Statements

Consider the following query:

CREATE TABLE T1 (A INT, B1 INT, B2 INT)
CREATE TABLE T2 (A INT, B INT)

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) BETWEEN T1.B1 AND T1.B2

Observe that the subquery in this query only needs to be evaluated once for each row of T1.  Indeed running on SQL Server 2000, we get the following plan (with the portion of the plan corresponding to the subquery in bold):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1004]))
**       |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1014]=0) then NULL else [Expr1015]))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1014]=COUNT_BIG([T2].[B]), [Expr1015]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1004]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1004]))
                 |--Table Scan(OBJECT:([T1]))

Now let's look at the query plan we get (with SQL Server 2005 or SQL Server 2008):

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008], [Expr1014]))
       |--Merge Join(Inner Join, MERGE:([T2].[A])=([T2].[A]), RESIDUAL:([T2].[A]=[T2].[A]))
**       |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))
       |    |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1026]=COUNT_BIG([T2].[B]), [Expr1027]=SUM([T2].[B])))
       |    |         |--Sort(ORDER BY:([T2].[A] ASC))
       |    |              |--Table Scan(OBJECT:([T2]))
       |    |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
       |         |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1028]=COUNT_BIG([T2].[B]), [Expr1029]=SUM([T2].[B])))
       |              |--Sort(ORDER BY:([T2].[A] ASC))
       |                   |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1014]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

Notice that the subquery is actually evaluated twice.  There are two scans of T2, two sorts, and two stream aggregates.  I've highlighted both sets of operators in bold.

Why is SQL Server evaluating the subquery twice?  The answer is that SQL Server transforms "X BETWEEEN Y AND Z" into "X <= Y AND X >= Z".  If as in this query, X is a subquery, the subquery is repeated:

SELECT *
FROM T1
WHERE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) >= T1.B1 AND
    (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) <= T1.B2

This transformation occurs very early in the processing of the query.  Unfortunately, after that point, SQL Server 2005 and SQL Server 2008 do not realize that the subquery is a common subexpression and they evaluate it as if there were two completely different subqueries.

You may also have noticed that, instead of scanning T1 first and then evaluating the subquery for each row of T1, this plan actually scans T2 first.  This join order results from the optimizer decorrelating the subqueries.

So, how can we get SQL Server to evaluate the subquery only once?  There are actually multiple solutions and all involve rewriting the query to calculate the subquery separately from the BETWEEN clause so that when SQL Server transforms the BETWEEN clause, it does not also duplicate the subquery.  Here are some examples:

SELECT Q.A, Q.B1, Q.B2
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q
WHERE SUM_B BETWEEN Q.B1 AND Q.B2

SELECT T1.*
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q
WHERE Q.SUM_B BETWEEN T1.B1 AND T1.B2

SELECT T1.*
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A AND Q.SUM_B BETWEEN T1.B1 AND T1.B2

All three of these rewrites produce the same plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[A], [Expr1008]))
**       |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1016]=(0) THEN NULL ELSE [Expr1017] END))
       |    |--Stream Aggregate(GROUP BY:([T2].[A]) DEFINE:([Expr1016]=COUNT_BIG([T2].[B]), [Expr1017]=SUM([T2].[B])))
       |         |--Sort(ORDER BY:([T2].[A] ASC))
       |              |--Table Scan(OBJECT:([T2]))
**       |--Filter(WHERE:([Expr1008]<=[T1].[B2]))
            |--Index Spool(SEEK:([T1].[A]=[T2].[A] AND [T1].[B1] <= [Expr1008]))
                 |--Table Scan(OBJECT:([T1]))

SQL Server also transforms a CASE statement of the form:

CASE X
    WHEN Y1 THEN Z1
    WHEN Y2 THEN Z2
    ...
    ELSE ZN
END

Into:

CASE
    WHEN X = Y1 THEN Z1
    WHEN X = Y2 THEN Z2
    ...
    ELSE ZN
END

Thus, CASE statements can yield the same problematic behavior if X is a subquery.  Unfortunately, with a CASE statement, the number of times that the subquery is reevaluated depends on the number WHEN clauses and can be quite large.  Here is a simple query that illustrates the problem:

SELECT *,
    CASE (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A)
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1

Here is the SQL Server 2000 plan:

  |--Compute Scalar(DEFINE:([Expr1007]=If ([Expr1004]=[T1].[B1]) then 'B1' else If ([Expr1004]=[T1].[B2]) then 'B2' else NULL))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |--Table Scan(OBJECT:([T1]))
            |--Hash Match(Cache, HASH:([T1].[A]), RESIDUAL:([T1].[A]=[T1].[A]))
**                 |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1020]=0) then NULL else [Expr1021]))
                      |--Stream Aggregate(DEFINE:([Expr1020]=COUNT_BIG([T2].[B]), [Expr1021]=SUM([T2].[B])))
                           |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**

And here is the SQL Server 2005 and SQL Server 2008 plan:

  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008]=[T1].[B1] THEN 'B1' ELSE CASE WHEN [Expr1014]=[T1].[B2] THEN 'B2' ELSE NULL END END))
       |--Nested Loops(Inner Join, PASSTHRU:([Expr1008]=[T1].[B1]), OUTER REFERENCES:([T1].[A]))
            |--Nested Loops(Left Outer Join, OUTER REFERENCES:([T1].[A]))
            |    |--Table Scan(OBJECT:([T1]))
**            |    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1030]=(0) THEN NULL ELSE [Expr1031] END))
            |         |--Stream Aggregate(DEFINE:([Expr1030]=COUNT_BIG([T2].[B]), [Expr1031]=SUM([T2].[B])))
            |              |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))
            |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1032]=(0) THEN NULL ELSE [Expr1033] END))
                 |--Stream Aggregate(DEFINE:([Expr1032]=COUNT_BIG([T2].[B]), [Expr1033]=SUM([T2].[B])))
                      |--Table Scan(OBJECT:([T2]), WHERE:([T2].[A]=[T1].[A]))**

Finally, the same solutions once again apply:

SELECT Q.A, Q.B1, Q.B2,
    CASE Q.SUM_B
        WHEN Q.B1 THEN 'B1'
        WHEN Q.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM
    (
    SELECT *, (SELECT SUM(T2.B) FROM T2 WHERE T2.A = T1.A) SUM_B
    FROM T1
    ) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1 CROSS APPLY (SELECT SUM(T2.B) SUM_B FROM T2 WHERE T2.A = T1.A) Q

SELECT T1.*,
    CASE Q.SUM_B
        WHEN T1.B1 THEN 'B1'
        WHEN T1.B2 THEN 'B2'
        ELSE NULL
    END CASE_B
FROM T1, (SELECT T2.A, SUM(T2.B) SUM_B FROM T2 GROUP BY T2.A) Q
WHERE T1.A = Q.A