Achieving Speedups with Small Parallel Loop Bodies

The Parallel class represents a significant advancement in parallelizing managed loops.  For many common scenarios, it just works, resulting in terrific speedups.  However, while ideally Parallel.For could be all things to all people, such things rarely work out, and we’ve had to prioritize certain scenarios over others.

One area Parallel.For may fall a bit short is in attempts to use it with very small loop bodies, such as:

int [] array = new int[100000000];
Parallel.For(0, array.Length, i=>
{
    array[i] = i*i*i;
});

Such an operation should be readily parallelizable; after all, every iteration is completely independent of every other iteration, and we’re dealing with a big data parallel problem.  The devil is in the details, however.  To handle typical scenarios where the time it takes to complete each iteration ends up being non-uniform, Parallel.For’s implementation takes a lot of care to load balance across all threads participating in the loop, and that load balancing comes at a small performance cost.  This cost is typically trumped by the performance benefits that come from doing the load balancing; however, when the body of the loop is as tiny as it is in the example above, even small overheads add up.  Another overhead that also contributes is the delegate invocation required to invoke the loop body.  It can be easy to forget when looking at a Parallel.For call that Parallel.For is really just a method, accepting as a parameter a delegate to be invoked for every iteration.  That invocation isn’t free, and in the above case may even be more expensive than the body of the loop itself.

Fear not, however, as there exist ways to still achieve good speedups on such cases.  One way is based on creating larger chunks of work for Parallel to operate on: as the chunk size increases, the overhead costs start to pale in comparison, and speedups are realized. 

Consider a new ForRange method you could implement:

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body);

Unlike For, which invokes the body once per iteration, ForRange will invoke the body with a start and end of a range.  Thus, given an initial sequential loop like the following:

for(int i=0; i<N; i++)
{
    DoWork(i);
}

with For it would be parallelized by replacing the for loop with a Parallel.For:

Parallel.For(0, N, i=>
{
    DoWork(i);
});

and with ForRange, it would be parallelized by wrapping the for loop with a ForRange:

ForRange(0, N, (from,to) =>
{
    for(int i=from; i<to; i++)
    {
        DoWork(i);
    }
});

There are several ways we can now implement ForRange.  The first is simply by doing a little math.  We can calculate the boundaries of each range and use a Parallel.For to run the user-supplied body action for each range, e.g.

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body)
{
    int numberOfRanges = Environment.ProcessorCount;
    int range = toExclusive – fromInclusive;
    int stride = range / numberOfRanges;
    if (range <= 0) numberOfRanges = 0;
    return Parallel.For(0, numberOfRanges, i => {
        int start = i * stride;
        int end = (i == numberOfRanges – 1) ? toExclusive : start + stride;
        body(start, end);
    });
}

Another way is actually by using Parallel.ForEach under the covers.  Rather than doing the math as was done above, we can write an iterator in C# that yields the ranges, and then Parallel.ForEach over those ranges, e.g.

public static ParallelLoopResult ForRange(
    int fromInclusive, int toExclusive, Action<int, int> body)
{
    int rangeSize = (toExclusive – fromInclusive) / Environment.ProcessorCount;
    if (rangeSize == 0) rangeSize = 1;
    return Parallel.ForEach(
        CreateRanges(fromInclusive, toExclusive, rangeSize), range =>
    {
        body(range.Item1, range.Item2);
    });
}

private static IEnumerable<Tuple<int,int>> CreateRanges(
    int fromInclusive, int toExclusive, int rangeSize)
{
    // Enumerate all of the ranges
    for (int i = fromInclusive; i < toExclusive; i += rangeSize)
    {
        int from = i, to = i + rangeSize;
        if (to > toExclusive) to = toExclusive;
        yield return Tuple.Create(from, to);
    }
}

(You can download an implementation of ForRange as part of the Beta 1 samples at http://code.msdn.microsoft.com/ParExtSamples.)

In general, we expect the design and implementation of Parallel.For will be right for the vast majority of scenarios.  However, solutions like those above can be used to accommodate cases that don’t quite fit the mold.