Scalar Subqueries

A scalar subquery is a subquery that returns a single row.  Some scalar subqueries are obvious.  For instance:

create table T1 (a int, b int)

create table T2 (a int, b int)

select *

from T1

where T1.a > (select max(T2.a) from T2 where T2.b < T1.b)

The subquery in this example uses an aggregate that ensures that it produces only a single row on each execution.  (Of course, due to the correlation, this single row may be different for each for of T1.)  Here is the query plan:

|--Filter(WHERE:([T1].[a]>[Expr1008]))
|--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[b]))
|--Table Scan(OBJECT:([T1]))
|--Stream Aggregate(DEFINE:([Expr1008]=MAX([T2].[a])))
|--Table Scan(OBJECT:([T2]), WHERE:([T2].[b]<[T1].[b]))

As you might expect, this plan works by scanning each row of T1 and with the help of the nested loops join evaluating the subquery once for each row.  Since the subquery always produces exactly one row, the number of rows output of the join is precisely equal to the number of rows in T1.  Finally, we use the filter to evaluate the WHERE clause of the original query ([Expr1008] is the result of the subquery) and to determine whether to return each row.

Assert

So, what happens if we use a subquery that is not guaranteed to return only a single row in a context where we expect a single row?  For instance, suppose we remove the aggregate from the above query:

select *

from T1

where T1.a = (select T2.a from T2 where T2.b = T1.b)

Without the aggregate, there is no longer any guarantee that the subquery produces only a single row on each execution.  If the subquery returns more than one row for a given row of T1, we must raise an error:

Msg 512, Level 16, State 1, Line 1
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

How does SQL Server detect this error?  Here is the query plan:

|--Filter(WHERE:([T1].[a]=[Expr1010]))
|--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[b]))
|--Table Scan(OBJECT:([T1]))
|--Assert(WHERE:(CASE WHEN [Expr1009]>(1) THEN (0) ELSE NULL END))
|--Stream Aggregate(DEFINE:([Expr1009]=Count(*), [Expr1010]=ANY([T2].[a])))
|--Table Scan(OBJECT:([T2]), WHERE:([T2].[b]=[T1].[b]))

This is essentially the same query plan from the first query except that the optimizer adds a count(*) to confirm that the subquery really does return only one row.  The assert operator enforces this requirement.  If it finds that the subquery returned more than one row (i.e., if [Expr1009] > 1), it raises the above error.  Note that we use the assert operator to check many other conditions such as constraints (check, referential integrity, etc.), the maximum recursion level for common table expressions (CTEs), warnings for duplicate key insertions to indexes built with the IGNORE_DUP_KEY option, and more.

The ANY aggregate is a special internal only aggregate that, as its name suggests, returns any row.  Since we raise an error if the scan of T2 returns more than one row, ANY is effectively a no-op.  We could just as easily use the MIN or MAX aggregates and get the same result.  Some aggregate is necessary since the stream aggregate expects each output column either to be aggregated or in the group by clause (which we do not have in this case).  This is the same reason that the following query does not compile:

select count(*), T2.a from T2

Msg 8120, Level 16, State 1, Line 1
Column 'T2.a' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Performance

The assert operator is not expensive per se, but it does limit the set of transformations available to the optimizer.  In particular, the optimizer must use a nested loops join and cannot change the join order.  In many cases, creating a unique index or rewriting the query eliminates the assert operator and improves the query plan.  For instance:

create unique clustered index T2b on T2(b)

select *

from T1

where T1.a = (select T2.a from T2 where T2.b = T1.b)

With the unique index on T2.b, the optimizer can be sure that each execution of the subquery returns exactly one row.  The new plan is:

|--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[a], [T1].[b]))
|--Table Scan(OBJECT:([T1]))
|--Clustered Index Seek(OBJECT:([T2].[T2b]), SEEK:([T2].[b]=[T1].[b]),  WHERE:([T1].[a]=[T2].[a]) ORDERED FORWARD)

Not only is the plan simpler with the elimination of the stream aggregate and assert operators, but it is now a regular join which means that we can change the join order and type.  For instance:

select *

from T1

where T1.a = (select T2.a from T2 where T2.b = T1.b)

option(merge join)

|--Merge Join(Inner Join, MERGE:([T2].[b], [T2].[a])=([T1].[b], [T1].[a]), RESIDUAL:([T2].[a]=[T1].[a] AND [T1].[b]=[T2].[b]))
|--Clustered Index Scan(OBJECT:([T2].[T2b]), ORDERED FORWARD)
|--Sort(ORDER BY:([T1].[b] ASC, [T1].[a] ASC))
|--Table Scan(OBJECT:([T1]))

Try this hint without the unique index and you get:

Msg 8622, Level 16, State 1, Line 1
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

Note that I’m using the hint only to illustrate that other plans are possible and that the optimizer has more and possibly better choices once we create the index.  I’m not suggesting that you should use the hint in lieu of allowing the optimizer to do its job and pick a good plan.

If you cannot or do not wish to create the index, you might also consider rewriting this query.  For instance, you might consider the following query:

drop index T2.T2b

select *

from T1

where T1.a in (select T2.a from T2 where T2.b = T1.b)

All I’ve done here is change from a scalar subquery to an IN subquery.  We no longer have a limitation of a single row.  You might also use an EXISTS subquery:

select *

from T1

where exists (select T2.a from T2 where T2.a = T1.a and T2.b = T1.b)

Both rewrites are equivalent.  They differ from the original query only in that the query does not fail if the subquery returns more than one row.  Here is the query plan for either rewrite:

|--Nested Loops(Left Semi Join, WHERE:([T2].[b]=[T1].[b] AND [T1].[a]=[T2].[a]))
|--Table Scan(OBJECT:([T1]))
|--Table Scan(OBJECT:([T2]))

Note that the inner join is now a semi-join since each row of T1 could match more than one row of T2.  Again, we can use a query hint to demonstrate a different join type:

select *

from T1

where T1.a in (select T2.a from T2 where T2.b = T1.b)

option(hash join)

|--Hash Match(Left Semi Join, HASH:([T1].[b], [T1].[a])=([T2].[b], [T2].[a]), RESIDUAL:([T2].[b]=[T1].[b] AND [T1].[a]=[T2].[a]))
|--Table Scan(OBJECT:([T1]))
|--Table Scan(OBJECT:([T2]))

So, be careful with scalar subqueries.  If you notice performance issues with a scalar subquery, consider whether you can use a unique index or a simple rewrite to get a better plan.

Tags

1. Craig Freedman's WebLog says: