Subqueries: ANDs and ORs

Article
08/30/2006

In my “Introduction to Joins” post, I gave an example of how we can use a semi-join to evaluate an EXISTS subquery. Just to recap, here is another example:

create table T1 (a int, b int)

create table T2 (a int)

create table T3 (a int)

select * from T1

where exists (select * from T2 where T2.a = T1.a)

|--Nested Loops(Left Semi Join, WHERE:([T2].[a]=[T1].[a]))
|--Table Scan(OBJECT:([T1]))
|--Table Scan(OBJECT:([T2]))

This plan returns those rows of T1 for which we find a matching row in T2.

Similarly, we can use an anti-semi-join to evaluate a NOT EXISTS subquery:

select * from T1

where not exists (select * from T2 where T2.a = T1.a)

|--Nested Loops(Left Anti Semi Join, WHERE:([T2].[a]=[T1].[a]))
|--Table Scan(OBJECT:([T1]))
|--Table Scan(OBJECT:([T2]))

This plan returns those rows of T1 for which we do not find a matching row in T2.

Note that SQL Server can use any join type (nested loops, merge, or hash) for each of these queries. All three join types support semi-joins.

Now, I’m going to take a look at what happens if you AND or OR together multiple EXISTS or NOT EXISTS subqueries. (Or, in mathematical terms how SQL Server handles conjuctions and disjunctions of subqueries.)

Conjunctions (ANDs)

Conjunctions are fairly simple. We simply combine multiple semi-joins one after the other in a sequence. This technique works for any combination of EXISTS and/or NOT EXISTS subqueries. For example:

select * from T1

where exists (select * from T2 where T2.a = T1.a) and

not exists (select * from T3 where T3.a = T1.b)

This plan returns rows of T1 for which we find a matching row in T2 and for which we do not find a matching row in T3. Again, SQL Server can use any join type for each of these semi-joins.

Disjunctions (ORs)

Disjunctions get more interesting. We cannot simply combine multiple semi-joins as we do for the conjunction case. Consider this query:

select * from T1

where exists (select * from T2 where T2.a = T1.a) or

exists (select * from T3 where T3.a = T1.b)

We must return a row of T1 if it satisfies either of the subqueries. We cannot discard a row unless if fails to satisfy both subqueries. There are two ways that the optimizer can transform this query that both suggest a way to execute it.

Option 1:

select * from T1

where exists

(

select * from T2 where T2.a = T1.a

union all

select * from T3 where T3.a = T1.b

)

This query combines the results of the two subqueries before performing the semi-join. For each row of T1, if either subquery produces a result, the semi-join will return the row of T1. Here is the query plan (generated from the original query not the union all rewrite):

|--Nested Loops(Left Semi Join, OUTER REFERENCES:([T1].[a], [T1].[b]))
|--Table Scan(OBJECT:([T1]))
|--Concatenation
|--Table Scan(OBJECT:([T2]), WHERE:([T2].[a]=[T1].[a]))
|--Table Scan(OBJECT:([T3]), WHERE:([T3].[a]=[T1].[b]))

Recall that the concatenation operator implements the union all.

Option 2:

There is no SQL syntax that precisely captures the second option, but this is close:

select * from T1

where exists (select * from T2 where T2.a = T1.a)

union

select * from T1

where exists (select * from T3 where T3.a = T1.b)

This query evaluates the two subqueries as two entirely separate joins and combines the results. However, a single row of T1 could qualify for both subqueries so we could get duplicates. Thus, we must also eliminate any duplicates. This is why I am using “union” rather than “union all.” However, as I mentioned above, this is not exactly correct because there could also be duplicate values in T1 itself. These duplicates should not be eliminated but will be. (Of course if T1 has a unique key and contains no duplicates, the rewrite is correct. If T1 does not have a unique key, we could manufacture one using the ROW_NUMBER function.) Luckily, whether of not we have an explicit unique key, SQL Server creates a unique relational key for every table which we can use internally to implement this rewrite. Here is an example of this query plan using hash join (generated from the original query using an “option(hash join)” hint):

|--Sort(DISTINCT ORDER BY:([Bmk1000] ASC))
|--Concatenation
|--Hash Match(Right Semi Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T2].[a]=[T1].[a]))
| |--Table Scan(OBJECT:([T2]))
| |--Table Scan(OBJECT:([T1]))
|--Hash Match(Right Semi Join, HASH:([T3].[a])=([T1].[b]), RESIDUAL:([T3].[a]=[T1].[b]))
|--Table Scan(OBJECT:([T3]))
|--Table Scan(OBJECT:([T1]))

[Bmk1000] is the record id for rows of T1 and is the unique relational key for this example. The sort distinct eliminates duplicates on [Bmk1000] as I described above.

Note that hash joins are right semi-joins instead of left semi-joins. Recall that right and left semi-joins are logically equivalent; the inputs are just switched.

Disjunctions and NOT EXISTS subqueries

For disjunctions with NOT EXISTS, we have the same two alternatives. We handle the NOT EXISTS subquery with a somewhat odd looking rewrite. For example, we can rewrite this query:

select * from T1

where not exists (select * from T2 where T2.a = T1.a) or

exists (select * from T3 where T3.a = T1.b)

As:

select * from T1

where exists

(

select *

from (select 0) cs(a)

where not exists (select * from T2 where T2.a = T1.a)

union all

select * from T3 where T3.a = T1.b

)

The bold portion of the above query basically inverts the subquery. If we find a matching row in T2, the NOT EXISTS clause is false and we actually return no rows from the bold subquery. If we do not find a matching row in T2, the NOT EXISTS clause is true and we do return a row from the bold query. The contents of the row are irrelevant since we throw it away in the semi-join with T1. Here is the plan:

The red portion corresponds to the bold subquery. The constant scan operator returns a single row; it corresponds to the “(select 0).”

Disjunctions and SQL Server 2000

SQL Server 2000 can implement disjunctions using the probe columns and passthru predicates which I described in my previous post. In particular, SQL Server 2000 must use these plans for disjunctions with NOT EXISTS subqueries. For example:

select * from T1

where not exists (select * from T2 where T2.a = T1.a) or

exists (select * from T3 where T3.a = T1.b)

|--Filter(WHERE:([Expr1008]=NULL OR [Expr1009]))
|--Nested Loops(Left Semi Join, WHERE:([Expr1008]=NULL)OUTER REFERENCES:([T1].[b]), DEFINE:([Expr1009] = [PROBE VALUE]))
|--Nested Loops(Left Semi Join, OUTER REFERENCES:([T1].[a]), DEFINE:([Expr1008] = [PROBE VALUE]))
| |--Table Scan(OBJECT:([T1]))
| |--Row Count Spool
| |--Table Scan(OBJECT:([T2]), WHERE:([T2].[a]=[T1].[a]))
|--Row Count Spool
|--Table Scan(OBJECT:([T3]), WHERE:([T3].[a]=[T1].[b]))

Both of the semi-joins in this plan have probe columns. Recall that semi-joins with the probe columns return every outer row whether or not it joins and the probe column is set to true if the row joined and false (NULL) if it did not. The bottommost semi-join tests the NOT EXISTS subquery – whether each T1 rows joins with a T2 row – and sets its probe column ([Expr1008]) accordingly. The topmost semi-join has a passthru predicate that tests [Expr1008]. So, if a row of T1 does not join with a row of T2, [Expr1008] is false (NULL), the NOT EXISTS subquery is true, the passthru predicate is true, and the topmost semi-join returns the row without evaluating the EXISTS subquery. On the other hand, if a row of T1 does join with a row of T2, [Expr1008] is true, the NOT EXISTS subquery is false, the passthru predicate is false, and the topmost semi-join executes the EXISTS subquery to determine whether the T1 row joins with a T3 row and sets its probe column ([Expr1009]) as appropriate. Finally, there is a filter at the top of the plan that tests both probe columns before returning each T1 row to determine whether the row satisfied at least one of the two subqueries.

The row count spool is an optimization. I will cover spools another time.

Although it is rare, it is possible to generate this same plan using merge joins instead of nested loops joins. This plan is:

StmtText	DefinedValues
\|--Filter(WHERE:([Expr1008]=NULL OR [Expr1009]))	[T1].[a], [T1].[b]
\|--Merge Join(Left Semi Join, MANY-TO-MANY MERGE:([T1].[b])=([T3].[a]), RESIDUAL:([T3].[a]=[T1].[b]) OR ([Expr1008]=NULL))	[T1].[a], [T1].[b], [Expr1008], [Expr1009]
\|--Sort(ORDER BY:([T1].[b] ASC))	[T1].[a], [T1].[b], [Expr1008]
\| \|--Merge Join(Left Semi Join, MANY-TO-MANY MERGE:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a]))	[T1].[a], [T1].[b], [Expr1008]
\| \|--Sort(ORDER BY:([T1].[a] ASC))	[T1].[a], [T1].[b]
\| \| \|--Table Scan(OBJECT:([T1]))	[T1].[a], [T1].[b]
\| \|--Sort(ORDER BY:([T2].[a] ASC))	[T2].[a]
\| \|--Table Scan(OBJECT:([T2]))	[T2].[a]
\|--Sort(ORDER BY:([T3].[a] ASC))	[T3].[a]
\|--Table Scan(OBJECT:([T3]))	[T3].[a]

I’ve included the DefinedValues column from showplan_all as the merge join probe column is not visible any other way.

Performance considerations

All of my examples in this post use heaps. Everything I’ve written about applies equally well to tables with indexes and indeed you should create indexes if you care about performance!

Nested loops semi-joins can exhibit erratic performance. The problem is that for each outer row a semi-join can stop searching once it finds a match from its inner input. If it can find this match quickly, performance will be good. If it cannot find this match quickly, it must keep searching until it reaches the end of the inner input or it does find a match. In this case performance may not be so good. The point is that the same plan can exhibit significant performance variation due to a small change in the data. The best solution to this problem is to create appropriate indexes so that the search for matches is as efficient as possible.

Next …

There is certainly more to write about subqueries and I expect that I will do so, but I will probably move on to a new topic in my next post. I’m thinking about starting in on aggregates.

Subqueries: ANDs and ORs

Additional resources