SQL Server parallelizes a nested loops join by distributing the outer rows (i.e., the rows from the first input) randomly among the nested loops threads. For example, if we have two threads running a nested loops join, we send about half of the rows to each thread. Each thread then runs the inner side (i.e., the second input) of the loop join for its set of rows as if it were running serially. That is, for each outer row assigned to it, the thread executes its inner input using that row as the source of any correlated parameters. In this way, the threads can run independently. SQL Server does not add exchanges to or parallelize the inner side of a nested loops join.
A simple example
Let’s consider a simple example. I am using a large table to induce the optimizer to choose a parallel plan. If you try these examples, it may take a few minutes to create these tables.
create table T1 (a int, b int, x char(200))
set nocount on
declare @i int
set @i = 0
while @i < 1000000
insert T1 values(@i, @i, @i)
set @i = @i + 1
select * into T2 from T1
select * into T3 from T1
create unique clustered index T2a on T2(a)
create unique clustered index T3a on T3(a)
select * from T1 join T2 on T1.b = T2.a where T1.a < 100
|100||2|||--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[b], [Expr1007]) OPTIMIZED)|
|100||2|||--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100)))|
|100||100|||--Clustered Index Seek(OBJECT:([T2].[T2a]), SEEK:([T2].[a]=[T1].[b]) ORDERED FORWARD)|
First note that there is only one exchange (i.e., parallelism operator) in this plan. Since this exchange is at the root of the query plan, all of the operators in this plan (the nested loops join, the table scan, and the clustered index seek) execute in each thread.
I deliberately did not create an index on T1. The lack of an index forces the table scan and the use of a residual predicate to evaluate “T1.a < 100”. Because there are one million rows in T1, the table scan is expensive and the optimizer chooses a parallel scan of T1.
The scan of T1 is not immediately below an exchange (i.e., a parallelism operator). In fact, it is on the outer side of the nested loops join and it is the nested loops join that is below the exchange. Nevertheless, because the scan is on the outer side of the join and because the join is below a start (i.e., a gather or redistribute) exchange, we use a parallel scan for T1.
If you recall my post from last week, a parallel scan assigns pages to threads dynamically. Thus, the scan distributes the T1 rows among the threads. It does not mater which rows it distributes to which threads.
Since I ran this query with degree of parallelism (DOP) two, we see that there are two executes each for the table scan and join (which are in the same thread). Moreover, the scan and join both return a total of 100 rows though we cannot tell from this output how many rows each thread returned. (We could determine this information using statistics XML output. I'll return to this question below.)
Next, the join executes its inner side (in this case the index seek on T2) for each of the 100 outer rows. Here is where things get a little tricky. Although each of the two threads contains an instance of the index seek and even though the index seek is below the join which is below the exchange, the index seek is on the inner side of the join and, thus, the seek does not use a parallel scan. Instead, the two seek instances execute independently of one another on two different outer rows and two different correlated parameters. Just as in a serial plan, we see 100 executes of the index seek: one for each row on the outer side of the join. No matter how complex the inner side of a nested loops join is, we always execute it as a serial plan just as in this simple example.
A more complex example
In the above example, SQL Server relies of the parallel scan to distribute rows uniformly among the threads. In some cases, this is not possible. In these cases, SQL Server may add a round robin exchange to distribute the rows. (A round robin exchange sends each subsequent packet of rows to the next consumer thread in a fixed sequence.) Here is one such example:
from (select top 100 * from T1 order by a) T1top
join T2 on T1top.b = T2.a
|--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[b], [Expr1007]) WITH UNORDERED PREFETCH)
|--Parallelism(Distribute Streams, RoundRobin Partitioning)
| |--Top(TOP EXPRESSION:((100)))
| |--Parallelism(Gather Streams, ORDER BY:([T1].[a] ASC))
| |--Sort(TOP 100, ORDER BY:([T1].[a] ASC))
| |--Table Scan(OBJECT:([T1]))
|--Clustered Index Seek(OBJECT:([T2].[T2a]), SEEK:([T2].[a]=[T1].[b]) ORDERED FORWARD)
The main difference between this plan and the plan from the original example is that this plan includes a top 100. The top 100 can only be correctly evaluated in a serial plan thread. (It cannot be split among multiple threads or we might end up with too few or too many rows.) Thus, we must have a stop (e.g., a distribute streams) exchange above the top and we cannot use the parallel scan to distribute the rows among the join threads. Instead we parallelize the join, by having this exchange use round robin partitioning distribute the rows among the join threads.
The parallel scan has one major advantage over the round robin exchange. A parallel scan automatically and dynamically balances the workload among the threads while a round robin exchange does not. As I demonstrated last week, if we have query where one thread is slower than the others, the parallel scan may help compensate.
Both the parallel scan and a round robin exchange may fail to keep all join threads busy if there are too many threads and too few pages and/or rows to be processed. Some threads may get no rows to process and end up idle. This problem can be more pronounced with a parallel scan since it doles out multiple pages at one time to each thread while the exchange distributes one packet (equivalent to one page) of rows at one time.
We can see this problem in the original (simple) example above by checking the statistics XML output:
<RelOp NodeId="1" PhysicalOp="Nested Loops" LogicalOp="Inner Join" ...>
<RunTimeCountersPerThread Thread="2" ActualRows="0" ... />
<RunTimeCountersPerThread Thread="1" ActualRows="100" ... />
<RunTimeCountersPerThread Thread="0" ActualRows="0" ... />
All of the join’s output rows where processed by thread 1. Why? The table scan has a residual predicate “T1.a < 100”. This predicate is true for the first 100 rows in the table and false for the remaining rows. The (three) pages containing the first 100 rows are all assigned to the first thread.
In this example, this is not a big problem since the inner side of the join is fairly cheap and contributes a small percentage of the overall cost of the query (with the table scan itself representing a much larger percentage of the cost). However, this problem could be more significant if the inner side of the query were more expensive. The problem is especially notable with partitioned tables. I will write about partitioned tables in a future post. In the meantime, this post from the SQL Server Development Customer Advisory Team blog illustrates the problem.