OPTIMIZED Nested Loops Joins

In my past two posts, I explained how SQL Server may add a sort to the outer side of a nested loops join and showed how this sort can significantly improve performance.  In an earlier post, I discussed how SQL Server can use random prefetching to improve the performance of a nested loops join.  In this post, I'm going to explore one more nested loops join performance feature.  I'll use the same database that I used in my two prior posts.  Let's start with the following simple query:

WHERE RandKey < 1000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Notice that the nested loops join includes an extra keyword: OPTIMIZED.  This keyword indicates that the nested loops join may try to reorder the input rows to improve I/O performance.  This behavior is similar to the explicit sorts that we saw in my two previous posts, but unlike a full sort it is more of a best effort.  That is, the results from an optimized nested loops join may not be (and in fact are highly unlikely to be) fully sorted.

SQL Server only uses an optimized nested loops join when the optimizer concludes based on its cardinality and cost estimates that a sort is most likely not required, but where there is still a possibility   that a sort could be helpful in the event that the cardinality or cost estimates are incorrect.  In other words, an optimized nested loops join may be thought of as a "safety net" for those cases where SQL Server chooses a nested loops join but would have done better to have chosen an alternative plan such as a full scan or a nested loops join with an explicit sort.  For the above query which only joins a few rows, the optimization is unlikely to have any impact at all.

Let's look at an example where the optimization actually helps:

WHERE RandKey < 100000000 AND
    Flags & 0x1 = 0x1 AND
    Flags & 0x2 = 0x2 AND
    Flags & 0x4 = 0x4 AND
    Flags & 0x8 = 0x8

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1014]=(0) THEN NULL ELSE [Expr1015] END))
       |--Stream Aggregate(DEFINE:([Expr1014]=COUNT_BIG([T].[Data]), [Expr1015]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)),  WHERE:(([T].[Flags]&(1))=(1) AND ([T].[Flags]&(2))=(2) AND ([T].[Flags]&(4))=(4) AND ([T].[Flags]&(8))=(8)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The Flags column contains the value 0xFF in every row.  Thus, every one of the bitwise AND predicates evaluates to true and this query returns about 2.5 million rows or 10% of the table.  Ordinarily, when faced with a query like this one, SQL Server would resort to a sequential scan of the entire table.  Indeed, if you try this query without the extra bitwise filters, you will get a sequential scan.  However, SQL Server does not realize that these predicates are always true, estimates a much lower cardinality of less than 10,000 rows, and chooses a simple nested loops join plan.  Note that I would generally recommend against using predicates like these ones in a real world application precisely because they will lead to cardinality estimation errors and poor plans.

To see what effect the optimized nested loops join has, let's compare the above plan with an "un-optimized" nested loops join.  We can eliminate the optimization by using the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is very small:


I'll compare the above query with the following simpler query which uses essentially the same plan and touches the same data but has an "un-optimized" nested loops join:

WHERE RandKey < 100000000

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

We can reset the statistics using the following statement:


As in my last post, I'm going to simulate a larger table by reducing the memory available to the server to 1 GByte with SP_CONFIGURE 'MAX SERVER MEMORY' and I'm also going to flush the buffer pool between runs with DBCC DROPCLEANBUFFERS.

Note that you will NOT want to run these statements on a production server.

I ran both of the above queries with three different constants.  Here are my results.  Keep in mind that these results depend greatly on the specific hardware.  If you try this experiment, your results may vary.


Execution Time





(1% of rows)

6.5 minutes

26 minutes


(10% of rows)

10.4 minutes

4.3 hours


(25% of rows)

11.3 minutes

10.6 hours


Clearly the optimized nested loops join can have a huge impact on performance.  Moreover, as the plan touches more rows the benefit of the optimization grows dramatically.  Although a full scan or a nested loops join with an explicit sort would be faster, the optimized nested loops join really is a safety net protecting against a much worse alternative.

Comments (4)
  1. Steve.Ash says:

    Some questions clarifications:

    1) What is the trade-off between doing the OPTIMIZED loop join vs the explicit sort operator?  I.e. do you know or have any intuition over how "much better" the sort is?  Or does it just depend on how "unsorted" the references are relative to the index keys?  

    2) Is the optimized loop considered a blocking operator?  Would it make my plan static if used in a cursor?  

    3) Are there any locking implications under read committed (not rcsi) of an optimized nested loop join vs a nested loop join with pre-fetching?  I.e. you have to hold the shared locks for the whole statement due to pre-fetching anyways, right?

    4) In the second plan, I see no PREFETCH keyword– so this plan uses no pre-fetching?  Why was pre-fetching not used here?  And as far as the results that you posted, how much of the difference was due to OPTIMIZED and how much was due to pre-fetching?

    5) Can you give any insight into how the optimized is sorting?  Is it just sorting whatever results are in the read-ahead buffer before the read cursor gets there?  Or will the cursor be "stopped" to allow more async I/Os to be completed to "better sort" them?

    6) Given question #5- I assume that OPTIMIZED must always use pre-fetching as well?

    Thanks so much!

    Steve Ash

  2. 1) This is difficult to assess.  An OPTIMIZED join never generates additional I/O.  An explicit sort may run out of memory and spill runs to disk creating additional I/Os.  So, in a pathological scenario, if the outer input to the join is already sorted but very large, the explicit sort could make matters worse.  On the other hand, if the explicit sort runs in memory, it could outperform the OPTIMIZED join by a decent margin.

    2) Yes, the OPTIMIZED join is a blocking operator.  However, the cursor code will revert to a regular loop join if you request a dynamic or fast forward cursor.

    3) Ordinary loop joins do not extend lock duration whether OPTIMIZED or WITH PREFETCH.  An OPTIMIZED bookmark lookup join does extend lock duration as I described here: blogs.msdn.com/…/read-committed-and-bookmark-lookup.aspx.  However, as you note a bookmark lookup join with PREFETCH also requires a holding locks until the end of the statement so combining OPTIMIZED with PREFETCH will not affect the lock duration.

    4) Good question.  I ran these experiments and wrote this post over a year ago.  Truthfully, I do not recall all of the details of the experiment.  However, I'm fairly sure that most of the performance difference that I measured is due to the use of the OPTIMIZED join.  It's worth noting that I ran these experiments on a machine with a single disk so the gain achievable by PREFETCH is limited to overlapping CPU and I/O (vs. driving multiple spindles simultaneously as discussed here: blogs.msdn.com/…/sequential-read-ahead.aspx).  Moveover, this type of performance gain for a single spindle is typically indicative of a change from random to sequential I/Os.

    5) Sorry, I cannot delve into that level of detail here.

    6) OPTIMIZED and WITH PREFETCH are separate features and can be used independently though they often do show up together.

    HTH and sorry for the delay responding …


  3. Gullimeel says:


    I would like to point out that the merge join algorithm is not efficient.

    When the inner table is accessed the scan starts at the first row and the table is scanned till it find the value of join key which is more than the value of join key from outer table. As soon as the join key value is more than the outer table join key row, the scan is not required. This is quite efficient algorithm to end the scan. However, the scan always starts at the first value or first row( I am assuming that there are no other sarg's which are causing the scan to start somewhere else or cause other index to be used). This part could be very inefficient, e.g. if we have clustered index on id and say outer table has minimm value 100000 and max value 105000.  Thus the scan of inner table which has clustered index on id should start at 100000 and end at 105000.Currently it is not happening.Currently it is starting at 1 and finsihing at 105000. Thus the rows from 1 to 99999 are uselessly processed. This is waste of CPU as well as it causes more logical reads.

    I have used an alternate method to make sure that scan starts at 100000 instead of 1. You will find the query and more details in the below link.




  4. Martin Smith says:

    @Steve I was curious for more details about implementation too.

    It is of course possible to infer some general idea from trying different values of @i  and looking at the actual number of rows in the plan.

    DECLARE @Data INT, @i int = 1

    SELECT TOP (@i) @Data = Data

    FROM T

    WHERE RandKey < 100000000 AND

       Flags & 0x1 = 0x1 AND

       Flags & 0x2 = 0x2 AND

       Flags & 0x4 = 0x4 AND

       Flags & 0x8 = 0x8

    OPTION (MAXDOP 1, OPTIMIZE FOR (@i=10000))

Comments are closed.

Skip to main content