Basics Process for Parallelizing an Application

Many of our customers are asking us for guidance on what they can do to make use of multicore systems.  This is a good time to be thinking about this problem since single thread performance will remain relatively flat for the forseeable future.  As developers throw more features into applications, there's a potential for performance regressions.  So, the question here is really about the process to follow in order to parallelize applications.  Below, I will outline a process that I've been advocating internally in my parallel development class as a practical approach to the problem.  I can provide more details upon request.

Step 1: Identifying Your Goals 

Before we even start discussing a parallelization process, we need to be aware of a primary principle: parallelism is about performance, and the performance that you should care care about is a function of the specific application and usage scenarios.  So, the first step in any performance improvement effort needs to include the following:

  • Identifying the application phases and or usage scenarios that are important from a performance perspective.
  • For each such phase or scenario, decide on a performance goal.  This step is critical because it drives the "completion" criterion of any performance effort.  For some usage scenarios, especially for interactive applications, setting the goals may be relatively straight-forward.  For example, usability experts typically set a limit of 200ms on any operation that is expected to be instantaneous by users.  For other, more time-consuming scenarios, the challenge is often making perceived response time short, so hiding latency is often an important feature of a responsive application.

Step 2: Measuring Your Baseline Performance

Now that you've identified your scenarios, you have to evaluate your current state compared to your ultimate goals.  The important steps here are:

  • Creating a performance test harness for each scenario identified in step 1. 
  • Run these tests on every build to track regressions.
  • Identify scenarios that are missing their goals.
  • For each failing scenario, identify the components of your application that contribute the most to performance.  This is a step that requires two important pieces of work.  First, instrumenting the code to identify where time is spent so that improvements/regressions may be narrowed down easilty.  Second, using profiling tools.  For this last step, you need to be careful about your choice of tools.  I can offer more help here in another post if people are interested.

Step 3: Traditional Tuning Comes First

If a scenario fails, you should attempt traditional performance tuning before considering parallelism.  Why?  Unnecessarily introducing parallelism can become a burden on your development and test resources.  With parallelism, you start introducing nondeterminism, race conditions, synchronization overheads, etc. and these are issues that just increase development and testing overheads.  So, it's best to avoid them. 

The main culprits for single thread performance are I/O and cache behavior (memory latency really).  This is a large topic that we can also get into, but here are some key points that you should keep in mind:

  • Reduce unnecessary I/O.
  • Hide I/O latency when it cannot be avoided.
  • Exploit available bandwidth to reduce latency, but coealescing I/O requests if possible.
  • Improve the temporal and spatial locality of your code to reduce cache misses and associated memory latency.

Step 4: Identify Opportunities for Parallelism

If you're here, then you've attempted step 3 for failing scenarios and you have reached the conclusion that you cannot meet your performance goals unless you parallelize.  There are two classes of parallelism that you can exploit usually: hiding latency from your user (or overlapping work) or reducing latency of computation.  Notice that the first is often related to I/O, but can also be used to hide CPU-bound work.  The second is almost always a technique to speed-up CPU-bound work.  The main forms of parallelism that you're likely to encounter fall under one or more of these categories:

  • Task-Level Parallelism: Multiple time-consuming independent tasks on the critical path can be executed in parallel.  Note that the word "task" has been overloaded significantly in our industry.  Here, I mean the dictionary (not CS-specific) definition.
  • Pipelining: A task may be broken down into independent stages that can execute in parallel.  If many such tasks exist, this can significantly improve the throughput of task hadling at a (hopefully small) latency penalty due to handshaking at stage boundaries.
  • Data Parallelism: This is an area that I predict will see most use for parallelism in the near future.  This form of parallelism exists when the application performance effectively the same work on independent pieces of data.  This can be exploited by dividing the data/work across multiple CPUs as well as possibly using SIMD (Single Instruction Multiple Data) instruction sets (ala MMX/SSE).

Step 5: Decide on Phase(s) to Parallelize

The decision should be based on some estimates of potential improvement.  You can optimistically assume linear speed up (= the number of cores working on a problem) for now.  Estimate, using the performance instrumentation that you've been using so far, what fraction of the target scenario's performance will be improved, and plug-in this data into something like Amdahl's Law to get an estimate of potential speedup for the whole scenario.  This is a critical step to avoid investing time on work that will not achieve your goals.

Step 6: Implement the Parallel Algorithm

This is a large topic, but I offer some guidance here:

  • There are usually well known coding patterns for exploiting the main types of parallelism (e.g., thread pools for task-level parallelism, parallel for loops such as the ones provided by our Task Parallel Library and Parallel Pattern Library).  You should become familiar with the common tools at your disposal.
  • Reduce sharing (especially write sharing) of data across parallelism boundaries to avoid cache coherence overheads with their associated latencies.
  • This is related to the above, but reducing synchronization overheads is critical.  If you come across a solution that requires a lock in an inner-loop, that's probably a bad idea! 
  • Remember that the best sequential algorithm may not be the best parallel algorithm.  So, be open minded to potentially changing your basic implementation.

Step 7: Tune and Iterate

Repeat the measurement step and don't forget about tuning the cache behavior of your parallel implementation.  The tools that my team is shipping in Visual Studio 2010 will be very valuable for parallel tuning.  You'll have to reduce synchronization overheads for example and identify reasons for blocking UI threads, contributing to UI hangs etc. and our tools will be a great help.

 I think that I've included enough here to give you a feel for what the process might look like.  There are many points in this post that can be significantly elaborated on, but I'll leave the choice to you folks.  Let me know what your questions/concerns are and I'll try to answer them.

 Take care.