Overview of the Parallel Dwarfs project on Codeplex

Article
12/21/2009

The Parallel Motifs, or Parallel Dwarfs as they are sometimes called, are a collection of algorithm families that are important to parallel computing as they are known to be compute bound. More computational cycles – if applied judiciously – will result in faster execution on a given data set for many algorithm instances from each of these families.

Background & problem

The motifs were originally proposed by Phil Colella and were later expanded upon by David Patterson (see this paper in CACM). There are several obvious problems when trying to experiment with your latest great idea in advancing the state of the art for the motifs.

1. The learning curve for applying parallel programming in different languages, platforms, or paradigms can be high.

2. Mechanisms for understanding complex behavior can be difficult to use successfully.

3. Source code that implements various motif algorithms can be difficult to understand or proprietary.

4. Setup for comparison of runs can be difficult and error prone, particularly when using an HPC environment.

5. Acquiring data files to use as input for the given motifs can be difficult.

What we have done

The Parallel Dwarfs project (see https://paralleldwarfs.codeplex.com) seeks to ameliorate the above issues. We have extended the number of Motifs to 13. For each of the motifs we have selected a representative algorithm from the family and implemented it, with the exception of the N-Body motif. These are the Motifs and the representative algorithm as we have them implemented today:

Motif	Instance
Branch and Bound	Traveling Salesman
Combinational Logic	Counting 1s
Dense Matrix Algebra	Matrix Multiply
Dynamic Programming	0/1 Knapsack Problem
Finite State Machine	Pattern Matching on Strings
Graphical Models	Viterbi Algorithm for Hidden Markov Models
Graph Traversal	Depth First Search
Map reduce	Map reduce
Sparse Matrix Algebra	Matrix Vector Multiply
Spectral Methods	2D FFT
Structured Grid	Laplace’s equation with Dirichlet’s conditions on the square
Unstructured Grid	2D Iterative Triangular Grid

Problem 1: The learning curve for applying parallel programming in different languages, platforms, or paradigms can be high.

To address the first problem we have implemented each of the above Motifs serially in C++ and in C#. Since this is an exercise in parallelism we also implemented the Motifs for shared memory using OpenMP in C++ and TPL in C#. For distributed memory implementations we created MPI implementations in C++ and MPI .NET implementations in C#. Each motif has at least these six implementations that can be directly compared to view performance differences. Several of the motifs have hybrid OpenMP/MPI or TPL/MPI .NET implementations. Additionally, there are a hand full of F# implementations.

These implementations are not intended to represent “best practices” for concurrent programming or “canonical” coding style. Rather they are a starting place to learn about parallel technologies, compare and contrast different styles and technologies in a side by side manner, and to dig deep into performance optimization possibilities without having to fully digest “real world” code.

In all there are 109 motif implementations that can be used as a starting point to understand the parallel programming languages, platforms, and paradigms.

Problem 2: Mechanisms for understanding complex behavior can be difficult to use successfully.

There are many ways to try to understand what is going on inside your program. The dwarfs include an application we call “dwarfbench” to help ease the difficulty in executing these potentially unfamiliar applications for analysis. The following applications can be run within dwarfbench on a specified collection of dwarfs:

Tool	Use
Xperf	Use this tool to understand how dwarfs interact with other processes, disk, memory, and resources in Windows. This can be particularly useful in finding bottlenecks outside of a single process.
Jumpshot	Use this tool to understand the message passing behavior of an MPI program.
Vampir	Use this tool to understand the message passing behavior of an MPI program, as with Jumpshot, however this tool has many advanced features that may give more subtle insights into an MPI program’s behaviors.
Excel	Use this tool to compare dwarf runtimes against each other and to record the command lines necessary to replay the execution.

Problem 3: Source code that implements various motif algorithms can be difficult to understand or proprietary.

The motifs implemented as the parallel dwarfs project are completely open source, released under the MSPL license. They can be used for commercial or personal projects – however, be warned: they are not “best practice” code, nor were they ever intended to be. They are a starting point, a place where your investigations can begin (not end) into parallel and distributed computing.

The algorithms selected to represent each of the motifs is fairly simple to understand. Each kernel implementation uses roughly the same algorithm, making necessary adjustments for the concurrency platform in play. In this way it is easy to see how one who is familiar with OpenMP might implement a similar parallel loop using PPL. Also the source code is well commented and accompanied by a design document that describes the algorithm’s behavior.

Problem 4: Setup for comparison of runs can be difficult and error prone, particularly when using an HPC environment.

One problem I have faced in the past is the monotonous and error prone nature of HPC setup for analysis runs. When there are many permutations of kernels, input files, and parameters that can be tried, a script is a good first step to eliminate some of the tedium. We have taken it up another notch in dwarfbench as dwarfbench handles running the dwarfs for you, even on a Windows HPC Server 2008 cluster.

One way we envision using the parallel dwarfs is by doing A/B comparison of two programs. Suppose you are reading through one of the kernels and you find yourself wondering how making a given change will affect the overall runtime of the dwarf. Simply copy the project and rename it to suit your needs (now there is a naming convention: dwarf.<name>.<platform>.<parallel tech>.exe). When you have made your change and built your project, dwarfbench will discover your new dwarf and make it available for all of the same analysis that can be run on existing kernels.

Problem 5: Acquiring data files to use as input for the given motifs can be difficult.

Since data sets can be private, proprietary, or government secrets (depending on the application), we created a simple random data set generator for each motif. Using a random data set can be bad because it does not necessarily represent real world data. However, given the need we could not see any better alternatives. Send me an email at robert.palmer@microsoft.com if you would like to contribute and have an idea about how we could create more realistic data sets.

Wrapping up

There is a ton of information and detail that I’m leaving out of this post – if you liked this post then that’s the bad news. The good news is that I will be writing future posts on this topic, delving into the deeper usage scenarios around the dwarfs on my blog at https://blogs.msdn.com/robertpalmer.

Robert Palmer - Parallel Computing Platform