We’re currently working on the Beta of .NET 4.0 (no dates to announce) and there are lots o’ new stuff for the Parallel Extensions. We hope you’re as excited about it as we are. Given that we have so much coming down the pipes, we decided to roll out posts on what’s coming in digestible chunks. What’s better to start with than the low-level infrastructure? In addition to the coordination data structures, you can expect posts on changes in the Task Parallel Library (TPL) as well as Parallel LINQ (PLINQ) and a few miscellaneous topics.
Since our last CTP in September ‘08, we’ve added a few new types to add to your parallel-programming arsenal.
In short, ConcurrentBag<T> is a thread-safe, unordered collection of objects. Sounds a bit like a set, but unlike a set, ConcurrentBag<T> allows duplicates. So now you’re thinking, what’s so interesting about an unordered collection that allows duplicates? Truth be told, in a single-thread world, not very much. In a multi-threaded world, however, removing ordering restrictions allows us to do some pretty cool optimizations that makes for a really scalable and efficient type in certain situations.
You can think of ConcurrentBag<T> as a thread-safe linked list of thread-safe double-ended queues (deques). Each thread that touches the bag has it’s own deque to which it adds and removes items from the top. When any thread’s deque is empty, it pulls from the bottom of another thread’s non-empty deque. You can see why this is scalable and efficient: in many cases, there is no contention at all. Contention is only an issue when a thread does not have any items and is forced to “steal” items from another thread. It’s worth noting that the add-to-the-top-steal-from-the-bottom approach plays nice with the cache: threads get to work on items in nice contiguous chunks. In fact, this type of work-stealing queue works so well that it’s principally how the Thread Pool in .NET 4.0 load-balances work.
Just to make it clear, ConcurrentBag<T> isn’t always the fastest choice. Consider the single producer/single consumer scenario: the producer would always steal from the consumer’s deque – ConcurrentQueue<T> would be much more efficient. Scenarios in which threads both produce and consume values will benefit from ConcurrentBag<T> the most.
ConcurrentLinkedList<T> and ConcurrentLinkedListNode<T>
ConcurrentLinkedList<T> is essentially a thread-safe version of LinkedList<T>. Of course there are some caveats, this is parallel programming we’re talking about. 🙂 Just inserting a value into a linked list becomes difficult in a multi-threaded world. Say you want to add items to a list in sorted order. With single-threaded linked lists, this is a simple as walking the list and finding the position where the left node is less than the value you want to insert and the right node is greater than or equal to that value. In a concurrent environment, between the time you’ve found the location and actually added the value, another value could have been inserted in that place or one of the neighboring nodes removed. Thus, ConcurrentLinkedList<T> provides methods that support these tasks with atomic operations, like the TryInsertBetween method:
list.TryInsertBetween( (left, right) => ((left.Value < item) && (right.Value > item)), item);
TryInsertBetween will walk the list and insert the value only when the supplied predicate returns true. It does all this atomically. If it fails, it returns false without changing the list.
Partitioner<T> and OrderablePartitioner<T>
When TPL or PLINQ consume a data structure, they do their best to load balance the elements between threads. Of course, TPL and PLINQ don’t have detailed knowledge about the data sources, such as each structures internals, the values of the data, and the length of the data source – all of which are factors in execution time. The APIs simply see that a type implements IEnumerable<T> or IList<T> and start processing the elements based on enumeration or indexing. Often times, this means that significant performance gains are overlooked. Partitioners allow you to achieve these gains.
For example, say I’ve created a thread-safe collection that protects elements with striped-locking (where elements 0, 3, and 6 are protected by lock A, 1, 4, and 7 by lock B and 2, 5, and 8 protected by lock C). If we just use the standard enumerator and the pattern will look like: acquire lock A, acquire lock B, acquire lock C, acquire lock A, acquire lock B, and so on. This is hardly an efficient manner to partition out elements, especially if other threads are playing around with the data structure. It would be much more efficient if we acquired each lock just once. In a perfect world, if Parallel.ForEach decided that three threads wanted to enumerate this hypothetical data structure, we’d want to give all of the elements under lock A to one thread, the elements of lock B to another and the elements of lock C to the last, significantly reducing the amount of time spent acquiring locks and potentially waiting for locks to be given up.
We’ll be posting more detail on partitioners soon, but suffice it to say that Partitioner<T> allows you to create custom partitioning algorithms for any data source. OrderablePartitioner<T> does the same, but keeps account of each element’s original index so the data can be reordered when necessary. Each of these abstract classes supports both static partitioning as well as dynamic partitioning. In addition to these extension points, we’re providing a few out-of-the-box partitioners, but I’ll keep you in suspense – check back for more detailed posts on partitioners soon!
In addition to adding new types, we also decided to cut one out.
In short, WriteOnce<T> was never really more than a simplification on top of Interlocked.CompareExchange and that doesn’t provide enough value at this point to warrant a new type in the .NET Framework. Farewell!
There are lots of minor refactors and name changes coming in Beta 1 but there’s only one major change for the coordination data structures that we should call out: LazyInit<T>. I won’t list every name change an name space change but it’s important to see how LazyInit<T> has been refactored. Originally, LazyInit<T> was a struct which gave it all the nuances that come with a value-type. We’ve since refactored LazyInit<T> into four types:
- System.Lazy<T>: a class-based lazy initialization primitive that’s safe to pass around.
- System.LazyVariable<T>: a struct-based lazy initialization primitive that has the value-type semantics (namely that you run the risk of re-initialization if you copy it or pass it to a method), but remains light-weight for situations where memory footprint is important.
- System.Threading.LazyInitializer: a set of static methods where memory footprint is really an issue.
- System.Threading.ThreadLocal<T>: originally a mode on LazyInit<T>, this is now its own type.
In addition to the refactoring, Lazy<T> and LazyVariable<T> now have both thread-safe and non-thread-safe initialization modes for those that need the lazy initialization functionality without the overhead of synchronization.
In the Beta 1 release, you’ll find a few types for a unified cancellation model intended to be used throughout the .NET framework. Cancellation holds enough merit to warrant a dedicated post on it so we’ll be posting one soon. What’s exciting is that we’ve added cancellation support throughout TPL, PLINQ and the coordination data structures. For the data structures specifically, we’ve added cancellation overloads to any of our APIs that may block, including:
Again, there are too many minor changes to call out but just be aware that some types have moved to different namespaces and might have some APIs that were renamed.
There are two API additions worth calling out, however:
We realized that a lot of scenarios for SpinWait went something like this… Check a condition. If false, spin. Check a condition, if false, spin. Check a condition. If false spin… We decided to make it a tad easier by allowing you to pass a Func<Boolean> to SpinWait, causing it to spin until the condition was met or a timeout occurred.
ConcurrentStack<T>.PushRange and ConcurrentStack<T>.PopRange
We’re all about squeezing out that last drop of performance and where we can, we do. Take ConcurrentStack<T> for example, which works on a single CompareAndSwap (CAS) operation if there is no contention. Many scenarios involve pushing or popping a series of items on or off the stack which typically results in looping through a collection, calling Push() on each item. This of course, results in N CAS operations and potentially a lot more if the stack is being contended on by multiple threads. Since we only need to do one CAS operation to update ConcurrentStack<T> anyway, we might as well build up a stack segment first and then push the whole segment onto the stack with a single CAS – reducing the number of potential retry CAS operations. Enter PushRange and PopRange, which do exactly that.
Check back soon for what’s new in TPL and PLINQ!