OPTIMIZED Nested Loops Joins

Article
03/18/2009

In my past two posts, I explained how SQL Server may add a sort to the outer side of a nested loops join and showed how this sort can significantly improve performance. In an earlier post, I discussed how SQL Server can use random prefetching to improve the performance of a nested loops join. In this post, I'm going to explore one more nested loops join performance feature. I'll use the same database that I used in my two prior posts. Let's start with the following simple query:

SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Notice that the nested loops join includes an extra keyword: OPTIMIZED. This keyword indicates that the nested loops join may try to reorder the input rows to improve I/O performance. This behavior is similar to the explicit sorts that we saw in my two previous posts, but unlike a full sort it is more of a best effort. That is, the results from an optimized nested loops join may not be (and in fact are highly unlikely to be) fully sorted.

SQL Server only uses an optimized nested loops join when the optimizer concludes based on its cardinality and cost estimates that a sort is most likely not required, but where there is still a possibility that a sort could be helpful in the event that the cardinality or cost estimates are incorrect. In other words, an optimized nested loops join may be thought of as a "safety net" for those cases where SQL Server chooses a nested loops join but would have done better to have chosen an alternative plan such as a full scan or a nested loops join with an explicit sort. For the above query which only joins a few rows, the optimization is unlikely to have any impact at all.

Let's look at an example where the optimization actually helps:

SELECT SUM(Data)
FROM T
WHERE RandKey < 100000000 AND
    Flags & 0x1 = 0x1 AND
    Flags & 0x2 = 0x2 AND
    Flags & 0x4 = 0x4 AND
    Flags & 0x8 = 0x8

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1014]=(0) THEN NULL ELSE [Expr1015] END))
       |--Stream Aggregate(DEFINE:([Expr1014]=COUNT_BIG([T].[Data]), [Expr1015]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)), WHERE:(([T].[Flags]&(1))=(1) AND ([T].[Flags]&(2))=(2) AND ([T].[Flags]&(4))=(4) AND ([T].[Flags]&(8))=(8)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The Flags column contains the value 0xFF in every row. Thus, every one of the bitwise AND predicates evaluates to true and this query returns about 2.5 million rows or 10% of the table. Ordinarily, when faced with a query like this one, SQL Server would resort to a sequential scan of the entire table. Indeed, if you try this query without the extra bitwise filters, you will get a sequential scan. However, SQL Server does not realize that these predicates are always true, estimates a much lower cardinality of less than 10,000 rows, and chooses a simple nested loops join plan. Note that I would generally recommend against using predicates like these ones in a real world application precisely because they will lead to cardinality estimation errors and poor plans.

To see what effect the optimized nested loops join has, let's compare the above plan with an "un-optimized" nested loops join. We can eliminate the optimization by using the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is very small:

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

I'll compare the above query with the following simpler query which uses essentially the same plan and touches the same data but has an "un-optimized" nested loops join:

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < 100000000

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

We can reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

As in my last post, I'm going to simulate a larger table by reducing the memory available to the server to 1 GByte with SP_CONFIGURE 'MAX SERVER MEMORY' and I'm also going to flush the buffer pool between runs with DBCC DROPCLEANBUFFERS.

Note that you will NOT want to run these statements on a production server.

I ran both of the above queries with three different constants. Here are my results. Keep in mind that these results depend greatly on the specific hardware. If you try this experiment, your results may vary.

		Execution Time		Increase
		OPTIMIZED	"un-OPTIMIZED"	Increase
Constant	10,000,000(1% of rows)	6.5 minutes	26 minutes	4x
	100,000,000(10% of rows)	10.4 minutes	4.3 hours	25x
	250,000,000(25% of rows)	11.3 minutes	10.6 hours	56x

Clearly the optimized nested loops join can have a huge impact on performance. Moreover, as the plan touches more rows the benefit of the optimization grows dramatically. Although a full scan or a nested loops join with an explicit sort would be faster, the optimized nested loops join really is a safety net protecting against a much worse alternative.

OPTIMIZED Nested Loops Joins

Additional resources