Query Data with Parallel LINQ


This post shows a simple way to write code that takes advantage of multiple processors. You will see that LINQ queries can allow you to side step the difficult tasks normally involved in writing multi-threaded code. To get started, all you need is a little basic knowledge of how to write simple LINQ queries.

The code shown in this post uses a pre-release version of PLINQ called the Microsoft Parallel Extensions to .NET Framework 3.5. When PLINQ finally ships, it will run only on .NET 4.0 or later. The version I'm using that runs on top of 3.5 is for evaluation purposes only. There will never be a shipping version that runs on .NET 3.5.

This LINQ provider is being created at Microsoft by the Parallel Computing team; it is not the work of the C# team that created LINQ to Objects and LINQ to SQL. Here is the website for the Parallel Computing team:

http://msdn.microsoft.com/en-us/concurrency/

At the time of this writing, these extensions were available only in pre-release form. You could download them either as Visual Studio 2008 compatible extensions to .NET 3.5, or as part of the pre-release version of Visual Studio 2010. Since the download sites might change over the coming months, I suggest that you find these resources by going to the Parallel Computing site, or to the Visual Studio site:

http://msdn.microsoft.com/en-us/vs2008

Parallel LINQ, or PLINQ, is only a small part of the Parallel Extensions to the .NET Framework. It is, however, an important part. Since it is a simple and natural extension of the LINQ syntax, I think developers familiar with that technology will find it easy to use.

Consider this code:

var list = Enumerable.Range(1, 10000);

var q = from x in list.AsParallel()
        where x < 3300
        select x;

foreach (var x in q)
{
    Console.WriteLine(x);
}

These lines look nearly identical to the code you have seen in many simple LINQ samples. The only significant difference is the call to AsParallel at the end of the first line. Though we have often used type inference to hide the return type of a LINQ query, I'm going to pause and take a second look at this instance. Rather than returning IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:

IParallelEnumerable<int> q = from x in list.AsParallel() etc….

In the near future, PLINQ queries of this type will probably return ParallelQuery<int>. Because this product is still evolving, it might be simplest to use var, at least during the pre-release phase, and let the compiler choose the type. That way you can save typing, avoid problems with anonymous types, and you need not concern yourself about changes in the API as the product develops. It is almost always appropriate to use var to designate the return type of a LINQ query, and there are only special circumstances when you would do otherwise.

Here are the results from this first PLINQ query:

2
1
3
4
6
512
5
7
513
8
12
514
9
13
515
10
14
516
11
15
517
16
72
518
17

The numbers shown here are in a relatively random order because they are being returned from different threads. It is important to remember that the sequence of values returned by LINQ is not always guaranteed to be presented in a particular order. If Order is important in your code, you can add a call to AsOrdered to the query after the call to AsParallel. Alternatively, you could insert a GroupBy clause to establish the desired ordering. Otherwise developers should assume that the ordering from a PLINQ query will be entirely random

Now that you understand the basics of Parallel LINQ, let’s move on to look at a more interesting example. Improved performance is the main reason to write code that can run in parallel. The program shown in this post uses a timer to demonstrate how PLINQ can improve performance in a program.

Performance improvements become more evident when our code has access to more processors. The code I show here runs faster on a two processor machine, but it really starts to come into its own on a four processor machine. Moving up to even more processors yields more powerful results. Here, for instance, are the results showing an improvement of 1.33 times when using two processors, and almost two times when using 4 processors:

2 Processors = 1.44 x improvement:
Linear: 00:00:13.15
Parallels: 00:00:09.10

4 Processors = 1.96 x improvement:
Linear: 00:00:15.00
Parallel: 00:00:07.68

These tests are being running against pre-release software, so these numbers are almost certain to change before release, and of course different machines will yield different results. Furthermore, the degree of improvement that you see is likely to change depending on the type of algorithm you run, the number of cores on your machine, the architecture of the machine, how many caches there are and how they’re laid out, etc. Though it is rare, some queries show superlinear performance enhancements. In other words, there is a greater than 4x speedup on a 4-core box. An improvement of 2 times, such as the one shown, or even a 3 time improvement, is common.

This sample program is called FakeWeatherData, and it is available for download from the LINQ Farm on Code Gallery. It features a simple LINQ to XML query run against a file with 10,000 records in it. The data I'm querying is not real, but consists of random dates and temperatures generated by a simple algorithm included in the FakeWeatherData program.

The XML file is structured like this:

<?xml version="1.0" encoding="utf-8" ?>
<Samples>
  <Sample>
    <Year>1973</Year>
    <Month>May</Month>
    <Day>15</Day>
    <Temperature>10</Temperature>
  </Sample>
  <Sample>
    <Year>1970</Year>
    <Month>Feb</Month>
    <Day>10</Day>
    <Temperature>14</Temperature>
  </Sample>
  <Sample>
    <Year>1970</Year>
    <Month>Jan</Month>
    <Day>15</Day>
    <Temperature>11</Temperature>
  </Sample>
  ... Many lines of code omitted here
</Samples>

There is also a simple C# class used by the program to encapsulate the data from the XML file:

class WeatherData
{
    public string Year { get; set; }
    public string Month { get; set; }
    public string Day { get; set; }
    public string Temperature { get; set; }
}

The parallel version of the query in the program looks like this:

for (int i = 0; i < NUM_REPS; i++)
{
    var list = (from x in doc.Root.Elements("Sample").AsParallel()
                where x.Element("Year").Value == "1973" &&
                   x.Element("Month").Value == "Apr" &&
                   x.Element("Day").Value == "15"
                select new WeatherData
                {
                    Day = x.Element("Day").Value,
                    Month = x.Element("Month").Value,
                    Temperature = x.Element("Temperature").Value,
                    Year = x.Element("Year").Value
                }).ToList();

}

Accompanying this code is a similar LINQ query that does not use PLINQ

for (int i = 0; i < NUM_REPS; i++)
{
    var list = (from x in doc.Root.Elements("Sample")
                where x.Element("Year").Value == "1973" && 
                   x.Element("Month").Value == "Apr" && 
                   x.Element("Day").Value == "15"
                select new WeatherData
                {
                    Day = x.Element("Day").Value,
                    Month = x.Element("Month").Value,
                    Temperature = x.Element("Temperature").Value,
                    Year = x.Element("Year").Value
                }).ToList();

}

The program queries the data in the XML file first using the Parallel code, then using standard LINQ. By comparing the time it takes each block of code to execute you can get a sense of the relative improvement available through PLINQ. I'll show you how to make such comparisons in just a moment. I will also discuss some tools that will become available to help profile code of this type.

You can see that the PLINQ query contains a call to AsParallel, while the other query does not. Other than that the two queries are identical. The fact that the two queries look so much alike points to a primary strength of PLINQ: very little specialized knowledge is necessary in order to begin using it. This does not mean that the subject is trivial, but only that the barrier to entry is low. This is not the case with most concurrent programming models.

LINQ queries are designed to be read-only, working with immutable data. This is a good model for parallelism, because it makes it unlikely that data will mutate, thereby setting up the potential for a race condition. You should note, however, that PLINQ does nothing to prevent this from happening, it is simply that LINQ is designed to make it unlikely.

Note also that the declarative LINQ programming style ensures that developers specify what they want done, rather than how it should be done. This leaves PLINQ free to ensure that concurrent LINQ queries run in the safest manner possible. If LINQ had been defined more strictly, such that it had to process each element in a certain order, then the PLINQ team would have had a much more difficult task.

The code in both these queries pulls out only the records from the XML file that have their date set to April 15, 1973. Because of deferred execution, the query would not do anything if I did not call ToList(). As a result, I added that call and converted the result into a List<WeatherData>. Though hardly earthshaking in import, these calls ensure that the code actually does something, and thus gives PLINQ scope to take advantage of the multiple processers on your system.

Simple timers are created to measure the difference between the standard LINQ query and the PLINQ query. I've also used a method used in many of Parallel LINQ team's samples for displaying the time elapsed during a test run:

private static void RunTest()
{
    XDocument doc = XDocument.Load("XMLFile1.xml");

    Stopwatch sw = new Stopwatch();

    sw.Start();
    LinqOrdinarie(doc);
    sw.Stop();
    ShowElapsedTime("Ordinaire", sw.Elapsed);

    sw.Reset();

    sw.Start();
    ParallelLinq(doc);
    sw.Stop();
    ShowElapsedTime("Parallels", sw.Elapsed);
}
private static TimeSpan ShowElapsedTime(string caption, TimeSpan ts)
{
    string elapsedTime = String.Format("{0}: {1:00}:{2:00}:{3:00}.{4:00}",
        caption, ts.Hours, ts.Minutes, ts.Seconds,
        ts.Milliseconds / 10);
    Console.WriteLine(elapsedTime, "RunTime");
    return ts;
}

At least with the pre-release version of PLINQ that I've played with, I've found it very useful to set up timers to confirm that PLINQ is actually able to speed up an operation. My record at guessing which code will benefit from running in parallel is not good, and so I find that confirming the effectiveness of the code by explicitly measuring it is worthwhile. You can either use the simple StopWatch class from the System.Diagnostics namespace, as shown here, or else you can use a profiler. Note that a thread aware profiler might ship with some versions of Visual Studio 2010.

I've found that the advantages of concurrent LINQ become more obvious the longer the operation I'm timing lasts. As a result, I've placed the query inside a loop, and added a variable to the program called NUM_REPS. By setting NUM_REPS to a large number, say 500, you can clearly see the benefits that can be accrued when you run LINQ queries in parallel on multiple processors. Note that the first time PLINQ is used, its assembly will need to be loaded, the relevant types will need to be JIT compiled, and new threads will need to be spun up, etc. As a result, many developers see improved performance after they get past the initial warm-up time.

Though it is very easy to get started with PLINQ, there are still complexities inherent in the subject that you need to consider. For instance, PLINQ will sometimes develop a different partitioning scheme for your data depending on whether you are working with an Enumerable or an Array. To learn more about this subject, see the following post from the Parallel Programming team:

http://blogs.msdn.com/pfxteam/archive/2007/12/02/6558579.aspx

The simple PLINQ examples shown in this post should help you get started with this powerful and interesting technology. Parallel LINQ is still in its infancy, but already it provides means of greatly simplifying tasks that are not normally easy to perform.

kick it on DotNetKicks.com

Comments (23)
  1. You’ve been kicked (a good thing) – Trackback from DotNetKicks.com

  2. Adam says:

    I hope that the C# team will consider building this in behind the scenes.  It seems to me that adding the parallel functionality to IEnumerable, ToList etc would not only make upgrades easier but further simplify the transition for developers.

  3. James Wilkinson says:

    I second Adam’s comment – why not make it the default?

    If your machine only has one processor, it’ll be a linear operation, but on multi-core or multi-cpu, you’ll get the benefits without having to write code to specifically target it.

    Are there any gotchas that means it couldn’t be the default? (other than the ParallelQuery<> return type..)

  4. Anders Borum says:

    Hi Charlie,

    an interesting article (as always). At the PDC 2008, I had a nice talk with the PM of the PLINQ team (who gave a really great presentation on the technology – of which most had already been made presented before, but still, it was nice to get the full overview and thoughts).

    When we look at the typical examples today for parallel LINQ, most use fairly long working sets (including those presented in this post). The common scenarios I face in the ECM industry(enterprise content management), the working sets are comprised of fewer items – however, having each item require more complex processing (thus, still a canditate for parallelization).

    I think it’s important to communicate that parallelization may be of great use, even though you’re facing smaller working sets of, say, 16 or 32 items. I overhead a few conversations at the PDC where other developers were unsure whether PLINQ would make a difference.

    So, I guess the best answer is: "measure measure measure" (stopwatch approach or profiling – of which the latter usually yields interesting results because the hotspots are not always where you’d think).

    With all the cores available, it would be really nice to see the PLINQ library have a lightning fast start-up, so that smaller and smaller working sets (with low processing requirements) can be run in parallel.

    Perhaps a future CLR (version 5.0) would be able to make assumptions like this and parallelize when it makes sense. This would remove the requirement of explicit using the .AsParallel() etc. operators (although I personally find it important that this part of the code is explicit).

  5. Charlie Calvert published a blog entry on the subject. It is a nice, quick read on PLinQ. Check it out:

  6. Juliano says:

    I hope this in behind

    I can’t believe that’s don’t work by default, have no sense.

  7. Marcel Popescu says:

    Any plans of adding stuff from the CCR (http://msdn.microsoft.com/en-us/library/bb905450.aspx) in this library?

  8. This statement is incorrect:

    Rather than returning IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:

    IParallelEnumerable<T> ^is^ an IEnumerable<T> as it derives from IEnumerable<T>.

  9. James Miles says:

    Hi Guys,

    Regarding;

    >“I hope that the C# team will consider building this in behind the scenes.  It seems to me that adding the parallel functionality to IEnumerable, ToList etc would not only make upgrades easier but further simplify the transition for developers.”

    >“I second Adam’s comment – why not make it the default?”

    There are so many exciting technologies targeting the parallel computing problem space it is not yet clear what the best approach or solution is going to be. Anders has said as much in a number of interviews/posts.

    PLINQ may turn out to be the solutions to all parallel programming problems, or they might just replace the CLR with a PCLR! 😉

    Cheers,

    James

  10. DanielMoth says:

    RE: "turn on by default" comment:

    If your query mutates shared state (which is easily done) then PLINQ by default would yield incorrect results. Same goes for ordering expectations (if you have any). So, right now you need to opt-in for good reasons – only you know if the semantics will be preserved while gaining the speedup…

    RE: CCR

    We are looking at bringing a similar programming model into a future version of .NET, but that will not happen in v4.

    Cheers

    Daniel

  11. ccalvert says:

    Thanks everyone for the great comments here.

    James is correct. Anders H. has indeed said that the team is looking at all of these technologies, and will probably incorporate some of them into C# proper, once there has been a chance to see what solutions work best in which circumstances. That might take time.

    Daniel Moth also raises good points about the issues surrounding mutable and immutable code and ordering expectations. It is not always possible to reliably detect these issues at compile time within a reasonable period of time, and that sets some restraints on our (and your) ability to detect which code is a good candidate for parallelism.

    The comments by Anders Borum about the importance of measuring your code to see if it is yielding the results you expect are very relevant. The startup time issue, as is often the case with .NET code, is one that needs to be watched. The team, of course, is looking at this issue carefully.

    Finally, remember that PLINQ sets the entry bar to writing concurrent code very low. This is just an introductory article to help you get started. As the comments shown here demonstrate, however, this is complex subject that deserves careful study. How does Andre Gide’s old saying go: "Please don’t understand me too quickly…." Use this article to help you get started, but continue to explore this subject as it develops over the coming months.

  12. Macke says:

    Its pretty obvious why this isnt made default, really. For example.

    int i = 0;

    foreach(int x in list)

    x = i++;

    this simple code would screw up pretty much since the elements doesnt come in a given order.

  13. Hakke says:

    I am highly anticipating this. I can see this really improving our ability to handling multi-threaded collections. It is not like it is such a hard task without it but having much leaner code is always a good thing.

  14. #.think.in says:

    #.think.in infoDose #12 (15th Dec – 19th Dec)

  15. Erik Cox says:

    It’s nice to see some examples of LINQ and PLINQ queries….I mean this was a very interesting reading for me…I mean I knew very little about LINQ..but after reading your post I got a lot clearer picture about what LINQ really is..

    Erik Cox

    http://www.notionsolutions.com

  16. Alexander says:

    This looks the same as I thought of and build somewhat more than 4 years ago,.. off course not with LINQ. Nice to know that my ideas are shared by Microsoft 🙂

  17. RussellH says:

    I have to confess that I don’t get how PLINQ could lead to much anything but chaos if not used very carefully.

    The trouble, I as see it, is that PLINQ appears to assume that the process running PLINQ is the only one on the box.  It tries to monopolize all the processors on the box.  Is that true?  That would be bad enough on a client application, but it would be disastrous on a sever application such as an ASP.NET or SQL Server application.

    My experience with modern desktop systems is that a user may have two or or even three monitors hooked up and may frequently have more than one application doing work at once.  At any given time there may be two or three runnable threads that can be scheduled on a processor, so it scales fairly well.  With PLINQ you have the possibility that each application will try to start as many threads as there are processors, and cause all kinds of overhead with task switching and scheduling.  Performance is likely to be worse if the technology is not applied carefully.

    In the case of sever code, I just can’t see how PLINQ makes any sense at all unless the number of users of the application or web service is very small (like one user at a time).  IIS and ASP.NET are already written to schedule multiple threads (in thread pools) to do processing in a way that uses available resources as fully as possible, while still attempting to be "fair" to any given client.  The same goes for SQL Server or any multiple user database system.  As someone who has written a fair amount of server code, I can say that doing this kind of thing right is hard.  If you throw in a process that wants to start creating its own threads (PLINQ), you are likely to kill performance and scalability.

    Maybe I don’t understand enough about PLINQ yet, but it seems an absolute requirement that PLINQ be very aware of the environment the process is running it.  It will need to know how many other processes are currently running on a system.  It will need to know if it running on a server or client process.

    The technology looks interesting, but I think Microsoft needs to be very careful about putting the barrier of entry too low.

  18. Jason says:

    My thoughts:

    Due to the aforementioned issues (i.e. parallelism changing order of output), it should be necessary to add explicitly.  

    However, it certainly doesn’t preclude a CodeRush or similar IDE enhancement from automatically listing use of LINQ (with reminders if order is missing) and making a conversion to .NET40 with parallelism much easier (i.e. similar to refactoring).

    [re: RusselH]

    I think you are overconcerned about the addition of an a low barrier to add parallelism to applications.  Many applications have more than one thread and any available processor can handle a thread (i.e. the worries you have presented already exist).  All PLINQ does is allow .NET to do opportunistic multi-threading (albeit explicitly I assume).  The workload on a system doesn’t change because you add the possibility of doing the work more efficiently.  Sure the workload will be spread over more processors at once, but other processes are already doing this by their existence (i.e. having more processes than processors already creates the scenario you are worried about, notwithstanding many of them being multithreaded)

    As far as having a lot of threads affecting user experience on the client, this issue already exists, and is already handled by:

    1.  Tuning already in place for client and server machines; i.e. dynamic thread priority giving foreground windows applications higher priority on client machines.  

    2.  Thread priority in general.  

    That said, the multi-processor environment along with many applications using more parallelism could increase the need for adding better UI for tuning in the highly parallelized computing world (i.e. slider controls on Task Manager for adjusting priority; or an admustment on the window ctrl-menu to adjust it’s priority; or new system wide options for allowing QoS metrics for processors).  

    I think PLINQ is the wrong place to address concerns which are more global to multi-processor / multi-threaded in general.  However, should advanced tuning metrics make their way into the OS in general, then PLINQ should gain/provide exposure to them as well.

    My 2 cents.

    Jason L

  19. ccalvert says:

    A couple quick clarifications:

    * When you call AsParallel(), you can pass in the number of processers that you want to use. That is, you need not use all the processors on a particular system.

    * Secondly, this is but one of several parallel programming initiatives that Microsoft is working on at this time. I think PLINQ is great because it is so easy to use, but it can be thought of as but one in a series of tools that is Microsoft developing in the parallel programming arena. I say this not because I’m backing off of my support for PLINQ, which I think is great, but just to be sure that everyone understands that this is not the companies only effort in this area. To learn more, visit the concurrency site, found as the first link near the top of this post.

  20. ccalvert says:

    A couple quick clarifications:

    * When you call AsParallel(), you can pass in the number of processers (threads) that you want to use. That is, you need not use all the processors on a particular system.

    * Secondly, this is but one of several parallel programming initiatives that Microsoft is working on at this time. I think PLINQ is great because it is so easy to use, but it can be thought of as but one in a series of tools that is Microsoft developing in the parallel programming arena. I say this not because I’m backing off of my support for PLINQ, which I think is great, but just to be sure that everyone understands that this is not the company’s only effort in this area. To learn more, visit the concurrency site, found as the first link near the top of this post

  21. Jason says:

    @Nicholas Paldino:

    I don’t see how the statement mentioning PLINQ returns back a different interface is false. Look at the type of the implicit var: what is it? Answer: IParallelEnumerable<T>. The statement was not about polymorphism or contravariance in .Net. The statement didn’t say you couldn’t pass IParallelEnumerable<T> to other APIs expecting IEnumerable<T>.

    On the contrary, it’s very important that you know the type is now IParallelEnumerable<T> because if you do pass that type to other APIs or you reuse that type in another query, do know that PLINQ WILL BE USED IN THOSE QUERIES.

    C# MVP? really?

  22. Mikael says:

    @Nicholas Paldino

    Incorrect? In no way. Even though IParallelEnumerable<T> inherits from IEnumerable<T> (and, in essense, is IEnumerable<T>) there is nothing incorrect in pointing out that it actually returns IParallelEnumerable<T> and not _just_ IEnumerable<T>.

    What is interesting here is the most derived interface, because the extension methods used by the compiler are hooked into that.

    I would hope that someone who puts ".NET/C# MVP" after his name would understand this. As a simpler example; someone ordering a Coors Light would not be served in exactly the same way as someone ordering just a beer, even though a Coors Light is a beer (though I can see an argument being made against that).

  23. Grant Rettke says:

    Great post!

    Thanks Charlie.

Comments are closed.

Skip to main content