CLR 4.0 ThreadPool Improvements: Part 1


This is the first in a series of posts about the improvements we are making to the CLR thread pool for CLR 4.0 (which will ship with Visual Studio 2010).  This post will cover changes to the queuing infrastructure in the thread pool, which aim to enable high-performance fine-grained parallelism via the Task Parallel Library.  Future posts will cover the “thread injection” algorithm, and any other topics that readers would like to see.

Please note that all the usual caveats apply here: I’m discussing pre-release software, and all details are subject to change before final release.  In fact, one goal of this post is to solicit feedback, so we can know what changes we need to make before we ship. J

A thread pool basically has two functions:  It maintains a queue (or queues) of work to be done, and a collection of threads which execute work from the queue(s).  So designing a thread pool really comes down to a) finding ways to enqueue and dequeue work items very quickly (to keep the overhead of using the thread pool to a minimum) and b) developing an algorithm for choosing an optimal number of threads to service the queues.  This post will cover part (a).

In all prior releases of the CLR, the thread pool exposed a single way to queue work: ThreadPool.QueueUserWorkItem (which I will abbreviate as QUWI from here on out).  There are a couple of overloads of this method, and also a version called UnsafeQueueUserWorkItem, but these all amount to basically the same thing: you give us a delegate, and we stick it on a queue for later execution.

(Really there are even more ways to queue work.  As mentioned in my previous post we really have two pools of threads – the “worker threads” and the “I/O threads.”  Work is queued to the I/O threads in response to the completion of asynchronous I/O, or manually via ThreadPool.UnsafeQueueNativeOverlapped.  We currently do not plan any significant changes to the I/O pool for CLR 4.0, as our focus for this release is on enabling fine-grained computational parallelism.  For the remainder of this post, we will only discuss the mechanisms behind the “worker threads.”)

QUWI conveys basically zero information about each work item, aside from that it exists.  This places some important constraints on the execution of these items.  For example, the thread pool does not know whether individual work items are related or not, so it has to assume they are all completely independent, implying that we cannot reorder work to optimize its execution, as independent work items typically must be executed in FIFO order to ensure fairness.  (Imagine each work item represents a request from a user – you would not want to keep earlier requests waiting while later requests are processed, as this would result in unacceptably long latency for the users who made their requests first).

This means that we are basically forced to use a single FIFO queue for all work queued via QUWI.  In prior versions of the CLR, this queue was a simple linked list, protected by a Monitor lock.  This incurs some overhead: we must allocate nodes for the list (and pay the cost of the GC having to traverse the list each time a GC occurs), and we must pay the cost of acquiring that lock every time we enqueue or dequeue a work item. 

(Another aside: please do not take the above to mean that we make any guarantees about strict FIFO ordering of work item execution.  In fact, we violate this “rule” already: since .NET 3.5, the CLR thread pool has maintained separate FIFO queues for each AppDomain in the process, and an additional independent FIFO queue for “native” work items such as those queued by a host (ASP.net being the prime user of this feature).  We round-robin between these work queues, allowing each to execute work for some time before moving on to the next.

This strategy is motivated by performance concerns, as it greatly reduces the number of transitions between AppDomains, which are fairly expensive.  But it is designed to maintain fairness, which is the chief concern which we can never completely abandon.  We may make further changes in the future which further deviate from the strict FIFO model, but we are unlikely to ever make QUWI really unfair, which, as we will see, is crucial to achieving good performance for fine-grained workloads.)

This was fine for the kinds of workloads for which the thread pool was originally designed. These are relatively “coarse” workloads, where each work item represents a fairly large amount of work.  The canonical example is an ASP.net web application, where each work item represents the generation of an entire web page.  In such workloads, the work itself takes long enough that the overhead of allocating queue nodes and acquiring locks is barely noticeable.

However, in the new world of machines with rapidly increasing core counts, there is increased interest in more “fine grained” work.  Where before the job of the thread pool was to take a large number of independent, coarse-grained, tasks and funnel them onto a few threads, we are increasingly being asked to execute many very small tasks representing tiny pieces of some larger operation. 

Anyone who has tried executing such a workload on the existing CLR thread pool has probably found that it’s not a simple matter of calling QUWI for each piece of the calculation; with such tiny work items, the overhead of enqueuing and dequeuing the work can be much greater than the work itself, resulting in slower execution than if we had just run the work on a single thread to begin with!  It is possible to make this work, by “batching” work into a smaller number of calls to QUWI.  There are many strategies for this, all of which are fairly complex in the general case.  We would like to make this easy, but the current QUWI is insufficient for this goal.

We can improve this situation in a couple of ways:  we can implement a more efficient FIFO queue, and we can enhance the API to allow the user to give us more information, allowing us to turn to even more efficient queuing strategies.  For CLR 4.0, we are doing both of these.

Faster FIFO

Recall that the overhead of the existing FIFO queue comes from the expense of allocating and traversing the data structure, and the cost of acquiring the lock on each enqueue and dequeue operation.  For 4.0, we are switching to a lock-free data structure with much lower synchronization overhead.  More importantly, this new queue is much friendlier to the GC; we still need to allocate a new object for each call to QUWI, but these objects are smaller, and are tracked in large “chunks” which are much easier for the GC to traverse than the simple linked list used previously.  This new queue is virtually identical to System.Collections.Concurrent.ConcurrentQueue<T>, which is also new in 4.0.

Improving the performance of QUWI is nice, as it benefits existing applications which use the thread pool without requiring any changes to the application code.  How much of a speedup you can expect will depend greatly on many factors, including your application’s workload and the details of the particular hardware on which it executes, but for fine-grained workloads on multi-core hardware the speedup should be significant.

However, we are still restricted in what we can do here – we still have very little information about the work we’re executing, and so we still need to use the same basic strategy to execute it.  We can trim overhead here and there, but QUWI will probably never be a great way to execute very fine-grained workloads.  We need a new API.

The Task Parallel Library

The Task Parallel Library (TPL) is a collection of new classes specifically designed to make it easier and more efficient to execute very fine-grained parallel workloads on modern hardware.  TPL has been available separately as a CTP for some time now, and was included in the Visual Studio 2010 CTP, but in those releases it was built on its own dedicated work scheduler.  For Beta 1 of CLR 4.0, the default scheduler for TPL will be the CLR thread pool, which allows TPL-style workloads to “play nice” with existing, QUWI-based code, and allows us to reuse much of the underlying technology in the thread pool – in particular, the thread-injection algorithm, which we will discuss in a future post.

I won’t discuss all of the details of the TPL API, which better covered by its authors. From the point of view of the performance of the thread pool, the important thing about TPL is that it is a much richer API than QUWI, giving the thread pool much more information about the work being executed.  In particular, the new Task type exposes the notion of parent/child relationships, giving us some idea of the structure of the overall computation being performed by the individual work items.  Having this information opens up possibilities for much more efficient execution of these tasks.

Even without parent/child relationships, Task is a major improvement over QUWI.  QUWI returns nothing of use to the caller; it simply queues a delegate, and leaves it up to the implementation of that delegate to coordinate its activities with the rest of the application.  QUWI provides no means of waiting for the completion of the work item, for handling exceptions, or getting the result of a computation.  Task provides all of this in a very easy-to-use form, while adding very little overhead vs. QUWI. 

The fact that Task has a Wait method is not just a convenience; it eliminates one of the most common problems people face when using QUWI.  It is fairly common for one work item to need to wait for the execution of another work item to complete.  If the second work item has not yet begun executing, it will be sitting in the queue waiting for a worker thread to pick it up.  It is possible that there are no available worker threads – maybe they’re all waiting for other work items to complete!  This can cause deadlock in the worst case, and very slow execution in the best, as the thread pool may be slow to add more worker threads to pick up these work items.  Task.Wait, on the other hand, knows it’s waiting for another task, and is tightly integrated with the thread pool such that it is able to determine whether the task has started executing, and if not it executes it immediately, in-line on the current thread.  This greatly improves performance and eliminates the possibility of deadlock in this situation.

For new code, Task is now the preferred way to queue work to the thread pool.

Top-level Tasks have no parent.  These are Tasks created by non-thread-pool threads, or with certain options specified at Task-creation time.  These tasks are queued to the same FIFO queue we use for QUWI, and thus benefit from the improvements we’ve made there – but they are also subject to the same limitations.  Tasks queued in this way are simply a better QUWI – but now the fun starts:  A parent task can create child tasks.  This happens whenever a Task creates another Task (unless it overrides this behavior).  These children are implicitly treated as sub-tasks of the larger task.  We assume that sub-tasks can be executed in any order – fairness is not necessary – because all that matters is that the overall operation be completed as fast as possible.  This lets us throw those FIFO restrictions out the window, and opens up the possibility for much more efficient work scheduling strategies.

Work Stealing

Since a child task is just a piece of a larger task, we don’t need to worry about execution order.  We just need to execute these things quickly.  One well-known strategy for fast execution of unordered work items is “work stealing.”  Joe Duffy and Daniel Moth explain this very well; click on the links if you’re interested.

The most important aspect of work-stealing is that it enables very fast enqueue and dequeue in the typical case, often requiring no synchronization at all.  This virtually eliminates a large part of the overhead of QUWI, when working with child tasks.  We still do need to allocate memory for the Task itself, and for the work-stealing queue, but like the improvements to the FIFO queue these data structures have been optimized for good GC performance.  Parent tasks are fast; child tasks are much faster.

There are still some limitations to how quickly tasks can be executed.  If all tasks are top-level (non-child) tasks, they are subject to the FIFO ordering constraints of QUWI (albeit with much richer functionality).  And even with work-stealing, we need to allocate and queue Task instances for every work item.  To get even better performance, we need even more information about the work.  Which brings us to…

Parallel.For and PLINQ

While not, strictly speaking, features of the CLR thread pool, the methods of the new Parallel class and PLINQ are critical new features of the public concurrency APIs in CLR 4.0.  In fine-grained parallel applications, it is very common to need to execute the same code, over and over, with different data inputs.  With QUWI or Task, this means allocating and queuing separate workitems for each input.  The thread pool infrastructure does not know that all of these work items do the same thing, so it literally has to execute them, one at a time, as if they were completely different tasks.

Parallel.For, Parallel.ForEach, and PLINQ, provide a better way.  These are essentially different ways of expressing the same thing:  here is some code that needs to execute N times, as quickly as possible.

Just as with the parent/child relationships that Task provides, this extra information enables more aggressive optimization.  These “parallel loops” do not need to be broken down into separate work items for each loop iteration.  All that is needed is to break them into enough chunks (“partitions”) that they can be efficiently load-balanced across all available machine resources.  A typical scenario might be that 1,000,000 iterations need to be broken into, say, four work items. 

There is, of course, some overhead introduced by the need to dynamically partition the data (done automatically by the framework).  But this pales in comparison to the savings of not having to allocate, queue, and dequeue millions (or more!) work items.  In a test I just tried, for a particular work load on one of my machines, Parallel.For executed more than 300 times as fast as the equivalent naïve usage of QUWI.

Where do we go from here?

By now you’ve probably got the general theme: the more information we have about a workload, the faster we are likely to be able to execute it.  I expect this theme to continue in future releases.  The challenge is to find new ways to easily express (or automatically extract) useful information about parallel workloads, which can be used by the thread pool (or higher-level abstractions like PLINQ) to enable more optimizations.  I think this is going to be an interesting space to watch for quite some time.

In the meantime, please try out the new mechanisms we are providing in Visual Studio 2010 Beta 1.  Try it with the kinds of workloads you expect to use in production; one of the biggest challenges we face is that we can really only guess at what developers will do with this stuff in the real world, so feedback from our customers is extremely important to ensure that these new mechanisms will meet your needs in the final product.

Feel free to post any questions you may have in the comments for this post; I’ll try to answer what I can.  My next post will cover the thread pool’s “thread injection” algorithm, which is how we determine how many threads should be servicing the various queues I’ve discussed here.  If there’s something else you’d like me to cover, please let me know.

Comments (28)

  1. bmeardon says:

    Eric,

    This is really fantastic and useful information.  I did read your reply on the PFX forum in regards to using Parallel.For instead of repeated calls to QUWI, but I don’t think that will work.  Let me describe the situation a bit more…

    I’m trying to process a very high number of messages per second that arrive through a socket.  As each message is received, I need to invoke a callback (passing it the message).  The callback needs to be called in the exact order the messages are recevied.  I currently accomplish this by using a FifoExecutor class (Stephen Toub’s implementation), which queues the work items up, grabs a thread from the pool (using UQUWI) and sits in a loop dequeing each work item, executing the callback, and then repeats until the queue is empty.  This is clearly quite fraught with a lot of enqueing an dequeing.  Do you have any suggestions on how I may be able to leverage the new CLR 4.0 ThreadPool and TPL in anyway to improve this scenario’s peformance?

    Thanks,

    Brandon

  2. Brandon,

    It definitely sounds like you may benefit from the improvements in QUWI.  It may well be worth trying the naive QUWI solution, which almost certainly will outperform previous CLR versions.

    Otherwise, given your strict FIFO ordering requirements, there may be little that can be done.  Optimization often requires reordering.  If you are able to deviate a little from the FIFO goal, you may be able to take advantage of the additional perfomance unlocked by child tasks, or maybe even Parallel.ForEach.  I, personally, would try QUWI first, given that this seems to require the least change in your existing code.  One of our goals for 4.0 is to provide significant performance improvements even without changes in existing application code.

    I am very interested in hearing about the results.  Once 4.0 Beta 1 has shipped, I hope you will have a chance to try it out.  Please let me know how this turns out.

  3. bmeardon says:

    Eric,

    I figured that would be the case.  Sounds like the queuing operations in the new ThreadPool will be ultra fast with the use of a lock free queue data structure (presumably using Interlocked methods).  Its also nice to hear that this lock free queue will be publically available through the TPL’s ConcurrentQueue<T> class.

    Given the information you provided about the huge performance increase attainable in the new TPL APIs by removing ordering requirements on the work items, we’ll probably look for some ways to take advantage of this by looking to paralellize our work as early as possible.

    One thing that never really got answered from my original post – Is there a way to control how Task objects are created using the new TaskFactory class?  If we decide to use the new TPL APIs, it would be nice to be able to return a Task from a pre-allocated pool of our own somehow and get rid of this allocation (I know we’ll still have the queue node allocation from the ThreadPool though).

    I’m looking forward to the 4.0 Beta 1 release.  I won’t waste any time to test out our performance with the new ThreadPool and will be sure to let you know how it goes.  Thanks again for your help Eric.

    Regards,

    Brandon

  4. Jordan says:

    Eric – Can you give us a clue as to when VS 2010 Beta 1 will be available?

  5. Brandon: Task objects are not currently re-usable.  I believe this is partly due to the complications it would add to the programming model, but it also has important performance implications – though not what you might think!  References to Task objects must be stored in many different places throughout a Task’s lifetime, including the ThreadPool’s queues.  Much of these data structures are lock-free, as you know.  Lock-free manipulation of object references is subject to the "ABA" problem (http://en.wikipedia.org/wiki/ABA_problem).  This problem turns out to be a non-problem in a garbage-collected environment, so long as you never reuse objects – but if you try to do that, then you need to deal with this problem.  There are techniques for dealing with this, but they inevitably require additional expensive synchronization (and other overhead).  So while we do incur some overhead for allocating Task objects, we avoid that synchronization overhead; in the end, it usually works out for the better to not reuse these objects.

    Jordan: I’m sorry, but I don’t have any information for you about the Beta 1 release date.  I certainly do wish I did.

  6. bmeardon says:

    Eric,

    Thanks again for the insightful information.  Its also interesting to know that the ThreadPool actually uses Task objects now as its "work item".

    Thanks,

    Brandon

  7. Michael Cederberg says:

    When you cover the thread injection algorithm, can you please talk about how it handles scenarious with a high number of tasks scheduled concurrently.

    Our setup is that we have a service that continously does calculations. If I use the .NET 3.5 threadpool to schedule those calculations, then I very quickly end up with 1000 threads  (there will continously be around 1000 workitems scheduled). If I try to reduce the number of threads (using ThreadPool.SetMaxThreads) to roughly the number of cores in my system, then I run the risk of deadlocks. To get around this, we currenly use a custom scheduler that schedules tasks on IO completion port threads, but this is not ideal either.

  8. MDFD says:

    Eric,

    I have a few questions please:

    1. What’s the limit of Parallel.For loops? I mean, how many iterations can I loop simultaneously? Does it depend on the amount of RAM, CPU?

    2. If my issue is to run 1,000,000 threads in parallel, should I use Parallel.For? or TaskFactory would be better?

    Thanks a lot,

  9. gOODiDEA.NET says:

    .NET C# COM Object for Use In JavaScript / HTML, Including Event Handling Show me the memory: Tool for

  10. gOODiDEA says:

    .NETC#COMObjectforUseInJavaScript/HTML,IncludingEventHandlingShowmethememory…

  11. progg.ru says:

    Thank you for submitting this cool story – Trackback from progg.ru

  12. MDFD:  Parallel.For is limited by the size of the index variable (i.e., an int can only count to 2^31), but otherwsie is not limited (as far as I know) to some number of iterations.  If you have 1,000,000 peices of work to do, it is likely that Parallel.For is what you want.  This will not create 1,000,000 threads, but rather will automatically partition the work into manageable chunks which are queued as just a few work items to the thread pool.

    There is also no reason why you cannot queue 1,000,000 Task objects to the thread pool, though it will use considerably more memory, and will have much higher CPU overhead, than Parallel.For – because we will need to allocate, queue, and dequeue each of those million Tasks.  But if you have a million completely different tasks (which seems unlikely) then Task is the way to go – Parallel.For is generally only useful if you want to run the same code against many individual peices of data.

  13. miridfd@gmail.com says:

    Thanks for your reply!

    I ran the same code against only (!) 1000 pieces of data, but the CPU reached 100% of usage, and got stuck. The whole process was blocked, and never terminated. (I’ve made several tests, even with less loops – e.g. 500, it sometimes got stuck sometimes not)

    I must note that the computation in the code (for each thread) is not complicated at all.

    Computer’s configuration:

    OS Vista

    Processor: Intel(R) Pentium(R) D CPU 3.40GHz 3.39GHz

    Ram:         4.00 GB

    System type:32-bit OS

    What do you think is the reason? When I ran the code sequentially (by regular "for" statement), everything was all right.

    Thank you very much

  14. miridfd: I’d need to understand better what exactly you’re trying to do, before I could say anything about why it’s not behaving the way you expect.  The best thing would probably be to post a sample that exhibits this behavior on the TPL discussion forum:  http://social.msdn.microsoft.com/Forums/en-US/parallelextensions.

  15. Eric Eilebrecht , a developer on our team, has just started a multi-part series on TheadPool improvements

  16. bmeardon says:

    Eric,

    You mentioned that the new queue used by the ThreadPool is more "GC friendly" by allocating smaller nodes in blocks.  I was curious about this, as I’ve been experimenting a bit with my own lock free stack (since its a bit easier than a queue).

    I’ve been trying to completely avoid allocating any new nodes on push operations, so I have my stack maitain a pool of pre-allocated nodes.  This pool is basically implemented as a seperate stack of nodes with a seperate top reference.  Each push operation gets a node from the pool and each pop operation returns a node.  The drawback is that I need to perform the same type of CAS operation to get/return nodes from/to the node pool to ensure that another thread hasn’t swooped in a changed my top node reference.  This results in an algorithm that under purely sequential conditions is almost twice as slow as allocating a new node for each push.  Under concurrent situations with high contention on the stack, it is also more likely to encouter situations where other threads have changed either the node top reference or the item top reference.  I imagine that these situations are what you were refering regarding performance situations when reusing task objects.

    I’d like to get rid of the node pooling in my implementation to avoid the double CAS looping, but I don’t want to just allocate a node object in the managed heap for each call to push.  Nodes allocated in this fashion that stay in the stack for any lengthly period of time stand a good chance of getting promoted up GC generations, causing large GC interference.  So I’m curious as to what you guys are doing to make your nodes in the ThreadPool queues so GC friendly and if the new TPL ConcurrentQueue/Stack will make use of this GC friendly node allocation mechnism.

    Thanks again for all of your help.

    Regards,

    Brandon

  17. Brandon,

    The main thing we did to improve the GC-friendliness of the new queues was to reduce the size of the nodes.  They weren’t particularly large to begin with (just a few object references) but that meant that removing just a couple of those references could make a huge difference.  Smaller objects means less GC pressure, so the GC doesn’t have to run as often.  This helps with the problem you mention: the less frequently the GC runs, the fewer objects need to be promoted.

    We also found that arrays of objects seem to be scanned much faster than linked lists.  This probably has a lot to do with data locality – a pass through a bunch of contiguous references is going to make much better use of the CPU’s cache than hopping all over the heap visiting lots of individual nodes.

    My advice for approaching this sort of optimization problem is to follow a methodology something like this:

    1) Implement something simple.  Simple things usually perform better, all else being equal.

    2) Construct some test cases you believe are representative of real-world usage.  Overly simplistic tests often lead you to optimize only for those cases, which can sometimes actually lead to worse performance in the field.

    3) Run your tests under a sampling profiler.  The VM functions involved in garbage collection are clearly named such that you can pretty easily tell which phase of the GC is taking the most time.  I was surprise to learn that in my tests we were spending much more time in the "mark" phase than I would have thought, which lead me to consider locality a little more carefully.

    4) Make some changes, and try again.

    Anyway, that’s how I did the thread pool work.  Nearly every line of code is influenced in some way by profiler feedback.  You do have to be very careful, though, not to optimize for the wrong cases.  Good tests cases are key here, but also it’s important to not take the profiler output too literally.  Rather than sayiing "hey, method Foo takes 50% of the time, I should optimize the heck out of that," instead use the profiler to update your mental model of the problem and then reconsider the design from that point of view.  Often Foo took more time because of a bunch of little things happening elsewhere in the system, and all the hacking on Foo in the world won’t help.

  18. bmeardon says:

    Eric,

    Thanks for the feedback.  Since my last post, I tried a new design that seems to exhibit superior performance to my pooling approach.  It also seems to be very similar to what you’ve implemented in the ThreadPool queue.  I have basically taken the approach you’ve outlined and thought I’d present my design here in hopes that you may be able to point out some pitfalls you ran into (since the designs seem so similar).

    1) The first thing I do is allocate an array of references to my my node class objects (which only consist of the item they contain and a ref to the next node).  I then create a new node object for each item in the array.

    2) On a push operation, I first "allocate" a node from my array, which I accomplish by simply returning the node at the next index in my array (their’s a CAS on the next node index).  I then set my top node to the node allocated from my array using another CAS loop.

    3) In the event that I run out of nodes during a push operation (current index is past the length of the array), I allocate a new array and set of nodes.

    4) The pop operation is just a very simple return of the top node and a CAS loop to set it to the next node.  The node is not returned to any sort of a pool, but will remain referenced in the node array until the node array is discarded.

    5) When CAS operations fail on any of my operations, I fall back to a spin wait that increases in iterations to a max exponentially.

    I did a lot of performance testing on this version vs using a the BCL Stack<T> class wrapped in lock statments.  I set up scenarios on boxes with 2, 4, and 8 CPUs.  Basically, I would have N (one half the CPUs on the box) producer threads pushing items in the stack and N (the other half of the CPUs) consumer threads poping items out.  I was suprised by just how much my setting of the number of spin iterations per failed CAS impacted the performance.  If I had this set too low, the BCL locked Stack<T> implementation would outperform my fancy lock free implementation.  In the end, I found that a max spin iteration of around 20K per CPU worked best and gave me roughly 2-3 times the throughput as the BCL locking implementation.  My spin wait actually starts at a number of around 4k for a 4 CPU box and works its way up to 80K under very rare circumstances.

    One area I think there may be some opprotunity for improvment on is how I allocate the array of nodes and initialize it with new nodes.  For my peformance tests, I allocated enough nodes upfront so that no push operations would need to allocate another block.  When running the test with 30 million items, this allocation takes quite a bit of time.

    Anyway, let me now what you think.  Thanks for your help.

    Regards,

    Brandon

  19. We’re very excited that the .NET Framework 4 Beta is now available for public download, as .NET 4 has

  20. Visual Studio 2010 Beta 1 est disponible pour les abonnés MSDN depuis quelques jours mais maintenant

  21. General purpose thread pools are more complicated to get right than you may think. In CLR 4 (the next

  22. Charles from Channel 9 stopped by my office a couple of weeks ago to chat with Erika Parsons and I about

  23. As we’ve mentioned previously, the .NET ThreadPool has undergone some serious renovations in .NET 4,

  24. Gal Ratner says:

    This is great stuff! Cant wait till it comes out with VS 2010

  25. Tal says:

    Hi Eric!

    I have start to use the Microsoft Parallel Extensions Jun08 and I got little to comment:

    To add object to the thread pool(or even create Task) the state of the delegete must be Object(on Task it’s Action<object>) and not "T"(generete iterator) for custom type.

    The answer seems simple- just  and unbox the state. But it cost a lot of performance the boxing and unboxing.

    Why can’t anyone make the functions ThreadPool.QueueUserWorkItem<T>(WaitCallback callBack, T state) and Task.Create<T>(Action<T> action, T state)?

    And why you cant create a delegete for them that take no argument? I know that it possible to do it now but you know that it’s cost performance too…

    One more thing that also bother me:

    Why it’s impossible to create Task and just later start it?

    Except for this I want to say that you done good job and can’t wait for CLR 4.0!

  26. Tal says:

    I read more about the thread pool(also the comments here) and you might think about my offer:

    You said to MDFD about the lot of memory for using a lot of tasks for the threadpool(for 1,000,000).

    I know you said him to use Parallel.For instead(and I agree) but think about how much memory is wasted.

    There is the MTA threads and the STA threads and look about the difference for the memory:

    MTA- using a lot of memory(becouse they many…)

    STA- using a little of memory(becouse it’s the only one – main thread)

    The thread pool gather a lot of tasks(it’s mean a lot of memory) usualy by the STA- main thread but it’s doesn’t realy matter STA or MTA.

    The point is that you may take think of making overload to the QUWI that block the current thread until there is a space in the thread pool and just than the current thread will go on.

    It’s mean that now the threadpool don’t be overloded by a tasks that can block the current thread!

    Think how much memory is wasted in the 1,000,000 tasks case!

    And another thing that is bugging me is the Parallel.For is that it’s wait untill the all tasks are done.

    Why???

    I think the answer is becouse if I had quad-core processor it wait becouse we want that we have 4 spaces one for each core(lenght/processor iteration for core)

    But if the task we wait for is heavy, it’s waste of time!

    I suggest that it will start immediately when there is space for a task. When there is anouther sapce avilable divide the For loop task to 2 tasks in this fashion.

    I example a process for quad-core:

    1.User ask for doing For loop(1,000 iteration).

    2.The program wait until there is a free space for task.

    3.Another task is done so the For loop begine with i=0 and max of 1,000.

    4.Another task is doneand the for loop is currect in 600(unstarted) so the program divide the the remaining 400 in ratio of 1:3 so the max of the current loop is down to 700 and we create task that doing form 700 to 1,000.

    5.Another task is done(no matter if it’s the loop task or simple task)  so the 700 to 1,000 loop is divide by ratio of 1:2

    6.The same thing ratio of 1:1.

    7.One of loop task is done so we still divided the same.

    How simple is this!

    Doesn’t it?

    I hope my point was clear and I didn’t make mistakes of my understanding.

  27. Kapil says:

    I'm trying to process a very high number of messages per second that arrive through a socket.  As each message is received, I need to process it. The processing can be in any order irrespective of the receiving conditions. We are currently using the system.collections.generic.queue class and  and we are creating threads using system.threading.thread class.  The threads are continuous and we are using thread.suspend when a queue is empty and thread.resume when a new item is inserted in the queue.  The time of processing is very important in our application. We are currently using CLR 3.5, Please tell us how to improve upon the performance.

Skip to main content