Parallelizing a query with multiple “from” clauses

Consider a simplified version of Luke Hoban’s LINQ ray tracer

var Xs = Enumerable.Range(1, screenWidth);

var Ys = Enumerable.Range(1, screenHeight);

 

var sequentialQuery =  

from x in Xs

from y in Ys

select new { X = x, Y = y, Color = TraceRay(x, y) };

 

If the screen width is much larger than the screen height, we would choose to parallelize the computation along the screen width. This is because PLINQ would be more beneficial when the overhead to parallelize a given of piece work into n “chunks” is much lower than the running the “chunk of work” itself. Since the screen width is larger, there is more potential for creating bigger chunks of work and in turn getting a better payoff for the overhead of parallelizing as compared to the height. To change sequentialQuery to use PLINQ, we need to modify it to   

var parallelAlongWidthQuery =

from x in Xs.AsParallel() //Parallel along screen width

from y in Ys

select new { X = x, Y = y, Color = TraceRay(x, y) };

Let us look at parallelAlongWidthQuery more closely. We can derive the exact methods that will be invoked for this query by referring to the MSDN documentation on C# 3.026.7.1 Query Expression Translation. Section 26.7.1.4 explains that the parallelAlongWidthQuery will be translated to something like –

(Xs.AsParallel()).SelectMany(

x => Ys,

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

 

The AsParallel () call converts Xs to an IParallelEnumerable type. [IParallelEnumerable enables the “parallel” implementations of the standard query operators] So, this query will bind to the Parallel version of SelectMany. When we run the application, we will notice that the query executes across all available processors on the machine.

We have successfully parallelized the application!!

 

However, assume that our dataset was different to begin with and the screen height was much larger than the screen width. Using the same logic for parallelizing along the larger sized dataset, we would choose Ys this time. To achieve this, it might be tempting to change the query to

var parallelAlongHeightQuery =     

from x in Xs

from y in Ys.AsParallel()//Parallel along screen height??

select new { X = x, Y = y, Color = TraceRay(x, y) };

 

Let us apply the query translation to parallelAlongHeightQuery. Using the MSDN documentation on C# 3.0 rules, we can say that parallelAlongHeightQuery will be translated to something like –

(Xs).SelectMany(

x => Ys.AsParallel(),

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

 

Xs is of Enumerable type and the compiler will bind to the LINQ version of SelectMany and not the expected parallel version! When we run the application, we will notice that the query does not execute across all available processors on the machine.

The 2 “from” clauses are not symmetric as they seem!

A good way to force parallelization across the screen height is to specify it as the first data source –

var parallelAlongHeightQuery =     

from y in Ys.AsParallel()//Parallel along screen height

from x in Xs

select new { X = x, Y = y, Color = TraceRay(x, y) };

This will translate to something like –

(Ys.AsParallel()).SelectMany(

x => Xs,

(x, y, Color) => new {X = x, Y = y, Color = TraceRay(x, y)});

Since Ys is now an IParallelEnumerable, this query will bind to the “parallel” version of SelectMany. When executed, we will notice that the query now runs across all available processors.

One more point to note is that even though SelectMany accepts mutliple datasources, PLINQ will partition work across the first datasource only. This is because it assumes that partioning across just one of the dataset will give enough parallelization and speedup.