If you look in the PLINQ samples in the December 2007 CTP, you’ll see a parallel implementation of Luke Hoban’s LINQ ray tracer. The sample parallelizes the ray tracer by changing very few lines of code. Luke’s original query started as follows:
from y in Enumerable.Range(0, screenHeight)
For our sample, we’ve changed that to:
from y in ParallelEnumerable.Range(0, screenHeight)
The result is that on a multi-core machine, multiple threads participate in the rendering of the ray traced image, thanks to the PLINQ implementation of the .NET Standard Query Operators. However, many of you probably noticed (or notice now that you’re reading this post) that this approach to parallelizing a query differs from the one we typically tout when discussing PLINQ, that of using the AsParallel extension method. In fact, this query could have been parallelized just as easily as follows:
from y in Enumerable.Range(0, screenHeight).AsParallel()
This will also result in multiple threads participating in the rendering on a multi-core machine. So, is there a difference between the two? There is. And in fact, for the ray tracer, the AsParallel approach (the one not taken by the samples) could result in faster rendering speeds, depending on the number of cores involved. (Try it for yourself and see if my assertion holds.)
Why do I believe the AsParallel approach may be faster? Other than the answer “I tried it on my dual-core,” the key is the kind of data partitioning these two approaches will use and the applicability of those partitioning algorithms to this particular problem domain. (Note that on a 16 core I tried it on, there was negligable difference.)
By default for a typical IEnumerable<T> when doing a query involving projections, filters, and the like, PLINQ will opt for chunk partitioning. This means that multiple threads involved in the operation will pull chunks of data from the enumerable. So, for example, thread 1 may grab the next eight items from the enumerable, thread 2 will then grab the next eight, and so forth. Since, by default, we don’t know anything about the thread-safety characteristics of the enumerable, we have to assume that it is not safe to access concurrently (e.g. MoveNext and Current must not be called concurrently from multiple threads), and thus access to the enumerable must be protected by a lock. This is the behavior you get when you use Enumerable.Range(…).AsParallel(); Enumerable.Range returns an enumerable foreign to PLINQ, and PLINQ uses chunk partitioning to distribute the data from this enumerable between all threads involved in the computation. In terms of the ray tracer, based on its implementation this means that each thread will grab a set of lines to render, render them, and then go back to grab the next available set of lines, render those, and so forth. (Note that PLINQ is a bit more sophisticated in terms of how much data it decides to retrieve from the enumerable on each trip back to it, but that’s largely irrelevant to this discussion.)
Now consider ParallelEnumerable.Range. PLINQ knows about the enumerable returned from ParallelEnumerable.Range because it recognizes the concrete type of the enumerable returned… it knows it represents a range, and it knows the bounds of that range. PLINQ has special logic it can use when it knows the bounds of a supplied enumerable and can index into it: range partitioning. Consider the kinds of optimizations PLINQ could make if it detected that the IEnumerable<T> provided to it was actually an IList<T>; it knows the size of the IList<T>, and it can index into the IList<T>. Thus, rather than using a lock to grab chunks from the IEnumerable using its MoveNext and Current members, it can simply divide up the list into logical ranges and hand one of those ranges off to each participating thread. The thread can then happily process that range without needing to rely on MoveNext and Current; this can be faster, as it avoids those interface calls, but also because it avoids continually locking and unlocking the underlying enumerable. PLINQ can do the same thing with ParallelEnumerable.Range. The difference, then, between the AsParallel and the ParallelEnumerable approaches is that the former uses chunk partitioning whereas the latter uses range partitioning like that just described.
You might notice some discrepancies in what I just wrote. On the one hand, I said that range partitioning is an optimization. On the other hand, I said that for the ray tracer, chunk partitioning may be faster. What gives?
In its current implementation, range partitioning assumes that the work required to process each element will be largely homogenous and will take roughly the same time. It splits the input data source into pieces based on the size of the input, not based on any real-time evaluation of how long previous elements from that data source have taken. In this sense, it’s a static process. Contrast this with chunk partitioning, which is much more dynamic in nature. If one set of elements takes much longer to process than another set, the thread processing the other set can go back to the enumerable to get more. Thus, it’s possible for one thread to process many more elements than another thread, which isn’t the norm for range partitioning. Now, many problem domains lend themselves well to range partitioning. Ray tracing, unfortunately, is not one of them.
Ray tracing works by, in effect, firing rays from a viewing location towards each pixel in the image. It checks for intersections between that ray and objects in the scene, finding the first object the ray hits. The color for the pixel through which the ray was fired will be affected by the color/texture of that intersected object. However, as in the real-world, rays in a ray tracer can reflect and refract based on the objects intersected. Thus, the process recurs and new rays are directed from the intersected object at appropriate angles. The colors of those rays are mixed with the color of the original intersected object to produce the actual color value for the pixel in question. I say all this to show that the amount of work that’s needed for each pixel depends partly on how many objects are intersected while processing that pixel. If you look at the top of the image rendered by the PLINQ ray tracer, you’ll see that there are no objects up there. Rays fired at those pixels don’t hit any objects, and thus are processed much quicker than pixels in the middle or at the bottom of the image, where many objects are touched as rays reflect all over the place. The result of all this is that range partitioning is particularly ineffective for ray tracing, because portions of the image will take much longer to render than other portions of the image.
Try running the ray tracer with the PLINQ code included with the CTP samples. On a dual-core machine, you’ll see that one thread starts rendering at the top of the image, while a second thread starts rendering in the middle of the image. This is range partitioning at work. The range (the horizontal scan lines of the image) have been split in half, with the top half processed by one thread and the bottom half processed by another. You’ll notice that the top half is completed much faster than the bottom half; this causes that first thread to sit idle while the second thread struggles through the remainder of the image. Now change the code to use the AsParallel version, and thus chunk partitioning. Run the parallel implementation again, and you should notice the image is rendered largely from top to bottom. You can actually see threads grabbing chunks of horizontal scan lines, rendering them, and then going back for more. The effect of this is that one thread doesn’t sit idly by for too long while another thread finishes its range; rather, as soon as one thread is done processing its chunk, it goes back to the enumerable to grab another. The only time a thread will sit idle is if there is no more work left to process from the enumerable.
Keep in mind that everything I’ve just described is in reference to the December 2007 CTP, and details will likely change as we improve the system. We have a lot of ideas for how to address these kinds of issues, so expect to see better and better performance from PLINQ as we move forward.