New book: Parallel Programming with Microsoft Visual Studio 2010 Step by Step

Article
09/07/2011

We're happy to announce the arrival of Parallel Programming with Microsoft Visual Studio 2010 Step by Step (ISBN 9780735640603; 256 pages) by Donis Marshall.

As chip speed flattens out and multi-core chips become ubiquitous, developers can no longer rely on clock rate advances to improve the speed of software. Instead, developers need to focus on how to identify the independent portions of their code, and program them to run simultaneously, in parallel, so they can take advantage of the new multi-core architecture. Not only does parallel programming require a change in code, it also requires a change in the way developers think about code.

Fortunately, Visual Studio and the.NET Framework, with the Task Parallel Library make entering the world of parallel programming easier than it has ever been before. But Donis explains this far better. Here's the complete first chapter.

Chapter 1

Introduction to Parallel Programming

After completing this chapter, you will be able to

· Explain parallel programming goals, various hardware architectures, and basic concepts of concurrent and parallel programming.

· Define the relationship between parallelism and performance.

· Calculate speedup with Amdahl’s Law.

· Calculate speedup with Gustafson’s Law.

· Recognize and apply parallel development design patterns.

Parallel programming will change the computing universe for personal computers. That is a grandiose statement! However, it reflects the potential impact as parallel computing moves from the halls of academia, science labs, and larger systems to your desktop. The goal of parallel programming is to improve performance by optimizing the use of the available processor cores with parallel execution of cores. This goal becomes increasingly important as the trend of constantly increasing processor speed slows.

Moore’s Law predicted the doubling of transistor capacity per square inch of integrated circuit every two years. Gordon Moore made this proclamation in the mid-1960s and predicted that the trend would continue at least ten years, but Moore’s Law has actually held true for nearly fifty years. Moore’s prediction is often interpreted to mean that processor speed would double every couple of years. However, cracks were beginning to appear in the foundation of Moore’s Law. Developers now have to find other means of satisfying customer demands for quicker applications, additional features, and greater scope. Parallel programming is one solution. In this way, Moore’s Law will continue into the indefinite future.

Microsoft recognizes the vital role of parallel programming for the future. That is the reason parallel programming was promoted from an extension to a core component of the common language runtime (CLR). New features have been added to the Microsoft .NET Framework 4 and Microsoft Visual Studio 2010 in support of parallel programming. This is in recognition of the fact that parallel programming is quickly becoming mainstream technology with the increased availability of multicore processors in personal computers.

Parallel code is undoubtedly more complicated than the sequential version of the same application or new application development. New debugging windows were added to Visual Studio 2010 specifically to help maintain and debug parallel applications. Both the Parallel Tasks and Parallel Stacks windows help you interpret an application from the context of a parallel execution and tasks. For performance tuning, the Visual Studio Profiler and Concurrency Visualizer work together to analyze a parallel application and present graphs and reports to help developers isolate potential problems.

Parallel programming is a broad technology domain. Some software engineers have spent their careers researching and implementing parallel code. Fortunately, the .NET Framework 4 abstracts much of this detail, allowing you to focus on writing a parallel application for a business or personal requirement while abstracting much of the internal details. However, it can be quite helpful to understand the goals, constraints, and underlying motivations of parallel programming.

Article I. Multicore Computing

In the past, software developers benefitted from the continual performance gains of new hardware in single-core computers. If your application was slow, just wait—it would soon run faster because of advances in hardware performance. Your application simply rode the wave of better performance. However, you can no longer depend on consistent hardware advancements to assure better-performing applications!

As performance improvement in new generations of processors cores has slowed, you now benefit from the availability of multicore architecture. This allows developers to continue to realize increases in performance and to harness that speed in their applications. However, it does require somewhat of a paradigm shift in programming, which is the purpose of this book.

At the moment, dual-core and quad-core machines are the de facto standard. In North America and other regions, you probably cannot (and would not want to) purchase a single-core desktop computer at a local computer store today.

Single-core computers have constraints that prevent the continuation of the performance gains that were possible in the past. The primary constraint is the correlation of processor speed and heat. As processor speed increases, heat increases disproportionally. This places a practical threshold on processor speed. Solutions have not been found to significantly increase computing power without the heat penalty. Multicore architecture is an alternative, where multiple processor cores share a chip die. The additional cores provide more computing power without the heat problem. In a parallel application, you can leverage the multicore architecture for potential performance gains without a corresponding heat penalty.

Multicore personal computers have changed the computing landscape. Until recently, single-core computers have been the most prevalent architecture for personal computers. But that is changing rapidly and represents nothing short of the next evolutionary step in computer architecture for personal computers. The combination of multicore architecture and parallel programming will propagate Moore’s Law into the foreseeable future.

With the advent of techniques such as Hyper-Threading Technology from Intel, each physical core becomes two or potentially more virtual cores. For example, a machine with four physical cores appears to have eight logical cores. The distinction between physical and logical cores is transparent to developers and users. In the next ten years, you can expect the number of both physical and virtual processor cores in a standard personal computer to increase significantly.

Section 1.01 Multiple Instruction Streams/Multiple Data Streams

In 1966, Michael Flynn proposed a taxonomy to describe the relationship between concurrent instruction and data streams for various hardware architectures. This taxonomy, which became known as Flynn’s taxonomy, has these categories:

· SISD (Single Instruction Stream/Single Data Stream) This model has a single instruction stream and data stream and describes the architecture of computer with a single-core processor.

· SIMD (Single Instruction Stream/Multiple Data Streams) This model has a single instruction stream and multiple data streams. The model applies the instruction stream to each of the data streams. Instances of the same instruction stream can run in parallel on multiple processors cores, servicing different data streams. For example, SIMD is helpful when applying the same algorithm to multiple input values.

· MISD (Multiple Instruction Streams/Single Data Stream) This model has multiple instruction streams and a single data stream and can apply multiple parallel operations to a single data source. For example, this model could be used for running various decryption routines on a single data source.

· MIMD (Multiple Instruction Streams/Multiple Data Streams) This model has both multiple instruction streams and multiple data streams. On a multicore computer, each instruction stream runs on a separate processor with independent data. This is the current model for multicore personal computers.

The MIMD model can be refined further as either Multiple Program/Multiple Data (MPMD) or Single Program/Multiple Data (SPMD). Within the MPMD subcategory, a different process executes independently on each processor. For SPMD, the process is decomposed into separate tasks, each of which represents a different location in the program. The tasks execute on separate processor cores. This is the prevailing architecture for multicore personal computers today.

The following table plots Flynn’s taxonomy.

Flynn’s Taxonomy	Single Data Stream	Multiple Data Streams
Single Instruction Stream	SISD	SIMD
Multiple Instruction Streams	MISD	MIMD

Additional information about Flynn’s taxonomy is available at Wikipedia: https://en.wikipedia.org/wiki/Flynn's\_taxonomy.

Section 1.02 Multithreading

Threads represent actions in your program. A process itself does nothing; instead, it hosts the resources consumed by the running application, such as the heap and the stack. A thread is one possible path of execution in the application. Threads can perform independent tasks or cooperate on an operation with related tasks.

Parallel applications are also concurrent. However, not all concurrent applications are parallel. Concurrent applications can run on a single core, whereas parallel execution requires multiple cores. The reason behind this distinction is called interleaving. When multiple threads run concurrently on a single-processor computer, the Windows operating system interleaves the threads in a round-robin fashion, based on thread priority and other factors. In this manner, the processor is shared between several threads. You can consider this as logical parallelism. With physical parallelism, there are multiple cores where work is decomposed into tasks and executed in parallel on separate processor cores.

Threads are preempted when interrupted for another thread. At that time, the running thread yields execution to the next thread. In this manner, threads are interleaved on a single processor. When a thread is preempted, the operating system preserves the state of the running thread and loads the state of the next thread, which is then able to execute. Exchanging running threads on a processor triggers a context switch and a transition between kernel and user mode. Context switches are expensive, so reducing the number of context switches is important to improving performance.

Threads are preempted for several reasons:

· A higher priority thread needs to run.

· Execution time exceeds a quantum.

· An input-output request is received.

· The thread voluntarily yields the processor.

· The thread is blocked on a synchronization object.

Even on a single-processor machine, there are advantages to concurrent execution:

· Multitasking

· A responsive user interface

· Asynchronous input-output

· Improved graphics rendering

Parallel execution requires multiple cores so that threads can execute in parallel without interleaving. Ideally, you want to have one thread for each available processor. However, that is not always possible. Oversubscription occurs when the number of threads exceeds the number of available processors. When this happens, interleaving occurs for the threads sharing a processor. Conversely, undersubscription occurs when there are fewer threads than available processors. When this happens, you have idle processors and less-than-optimum CPU utilization. Of course, the goal is maximum CPU utilization while balancing the potential performance degradation of oversubscription or undersubscription.

As mentioned earlier, context switches adversely affect performance. However, some context switches are more expensive than others one of the more expensive ones is a cross-core context switch. A thread can run on a dedicated processor or across processors. Threads serviced by a single processor have processor affinity, which is more efficient. Preempting and scheduling a thread on another processor core causes cache misses, access to local memory as the result of cache misses, and excess context switches. In aggregate, this is called a cross-core context switch.

Section 1.03 Synchronization

Multithreading involves more than creating multiple threads. The steps required to start a thread are relatively simple. Managing those threads for a thread-safe application is more of a challenge. Synchronization is the most common tool used to create a thread-safe environment. Even single-threaded applications use synchronization on occasion. For example, a single-threaded application might synchronize on kernel-mode resources, which are shareable across processes. However, synchronization is more common in multithreaded applications where both kernel-mode and user-mode resources might experience contention. Shared data is a second reason for contention between multiple threads and the requirement for synchronization.

Most synchronization is accomplished with synchronization objects. There are dedicated synchronization objects, such as mutexes, semaphores, and events. General-purpose objects that are also available for synchronization include processes, threads, and registry keys. For example, you can synchronize on whether a thread has finished executing. Most synchronization objects are kernel objects, and their use requires a context switch. Lightweight synchronization objects, such as critical sections, are user-mode objects that avoid expensive context switches. In the .NET Framework, the lock statement and the Monitor type are wrappers for native critical sections.

Contention occurs when a thread cannot obtain a synchronization object or access shared data for some period of time. The thread typically blocks until the entity is available. When contention is short, the associated overhead for synchronization is relatively costly. If short contention is the pattern, such overhead can become nontrivial. In this scenario, an alternative to blocking is spinning. Applications have the option to spin in user mode, consuming CPU cycles but avoiding a kernel-mode switch. After a short while, the thread can reattempt to acquire the shared resource. If the contention is short, you can successfully acquire the resource on the second attempt to avoid blocking and a related context switch. Spinning for synchronization is considered lightweight synchronization, and Microsoft has added types such as the SpinWait structure to the .NET Framework for this purpose. For example, spinning constructs are used in many of the concurrent collections in the System.Collections.Concurrent namespace to create thread-safe and lock-free collections.

Most parallel applications rely on some degree of synchronization. Developers often consider synchronization a necessary evil. Overuse of synchronization is unfortunate, because most parallel programs perform best when running in parallel with no impediments. Serializing a parallel application through synchronization is contrary to the overall goal. In fact, the speed improvement potential of a parallel application is limited by the proportion of the application that runs sequentially. For example, when 40 percent of an application executes sequentially, the maximum possible speed improvement in theory is 60 percent. Most parallel applications start with minimal synchronization. However, synchronization is often the preferred resolution to any problem. In this way, synchronization spreads—like moss on a tree—quickly. In extreme circumstances, the result is a complex sequential application that for some reason has multiple threads. In your own programs, make an effort to keep parallel applications parallel.

Article II. Speedup

Speedup is the expected performance benefit from running an application on a multicore versus a single-core machine. When speedup is measured, single-core machine performance is the baseline. For example, assume that the duration of an application on a single-core machine is six hours. The duration is reduced to three hours when the application runs on a quad machine. The speedup is 2—(6/3)—in other words, the application is twice as fast.

You might expect that an application running on a single-core machine would run twice as quickly on a dual-core machine, and that a quad-core machine would run the application four times as fast. But that’s not exactly correct. With some notable exceptions, such as super linear speedup, linear speedup is not possible even if the entire application ran in parallel. That’s because there is always some overhead from parallelizing an application, such as scheduling threads onto separate processors. Therefore, linear speedup is not obtainable.

Here are some of the limitations to linear speedup of parallel code:

· Serial code

· Overhead from parallelization

· Synchronization

· Sequential input/output

Predicting speedup is important in designing, benchmarking, and testing your parallel application. Fortunately, there are formulas for calculating speedup. One such formula is Amdahl’s Law. Gene Amdahl created Amdahl’s Law in 1967 to calculate maximum speedup for parallel applications.

Section 2.01 Amdahl’s Law

Amdahl’s Law calculates the speedup of parallel code based on three variables:

· Duration of running the application on a single-core machine

· The percentage of the application that is parallel

· The number of processor cores

Here is the formula, which returns the ratio of single-core versus multicore performance.

1 – P + (P/N)

Speedup =

G01PV01

This formula uses the duration of the application on a single-core machine as the benchmark. The numerator of the equation represents that base duration, which is always one. The dynamic portion of the calculation is in the denominator. The variable P is the percent of the application that runs in parallel, and N is the number of processor cores.

As an example scenario, suppose you have an application that is 75 percent parallel and runs on a machine with three processor cores. The first iteration to calculate Amdahl’s Law is shown below. In the formula, P is .75 (the parallel portion) and N is 3(the number of cores).

(1 – .75) + (.75/3)

Speedup =

G01PV02

You can reduce that as follows.

.25 + .25

Speedup =

G01PV03

The final result is a speedup of two. Your application will run twice as fast on a three-processor-core machine.

Speedup = 2

G01PV04

Visualizing speedup can help you interpret the meaning of Amdahl’s Law. In the following diagram, the evaluation of speedup is presented as a graph. Duration is represented as units of equal length. On a single-core machine, application duration is four units. One of those units contains code that must execute sequentially. This means that 75 percent of the application can run in parallel. Again, in this scenario, there are three available processor cores. Therefore, the three parallel units can be run in parallel and coalesced into a single unit of duration. As a result, both the sequential and parallel portions of the application require one unit of duration. So you have a total of two units of duration—down from the original four—which is a speedup of two. Therefore, your application runs twice as fast. This confirms the previous calculation that used Amdahl’s Law.

G01PV05

Jeanne: In this graphic, please change "4 Units of Duration" to "Four units of duration", "Coalesce 3 Units of Duration" to "Coalesce three units of duration", and "2 Units of Duration" to "Two units of duration".

You can find additional information on Amdahl’s Law at Wikipedia: https://en.wikipedia.org/wiki/Amdahl%27s\_Law.

Section 2.02 Gustafson’s Law

John Gustafson and Edward Barsis introduced Gustafson’s Law in 1988 as a competing principle to Amdahl’s Law. As demonstrated, Amdahl’s Law predicts performance as processors are added to the computing environment. This is called the speedup, and it represents the performance dividend. In the real world, that performance dividend is sometimes repurposed. The need for money and computing power share a common attribute. Both tend to expand to consume the available resources. For example, an application completes a particular operation in a fixed duration. The performance dividend could be used to complete the work more quickly, but it is just as likely that the performance dividend is simply used to complete more work within the same fixed duration. When this occurs, the performance dividend is not passed along to the user. However, the application accomplishes more work or offers additional features. In this way, you still receive a significant benefit from a parallel application running in a multicore environment.

Amdahl’s Law does not take these real-world considerations into account. Instead, it assumes a fixed relationship between the parallel and serial portions of the application. You may have an application that’s split consistently into a sequential and parallel portion. Amdahl’s Law maintains these proportions as additional processors are added. The serial and parallel portions each remain half of the program. But in the real world, as computing power increases, more work gets completed, so the relative duration of the sequential portion is reduced. In addition, Amdahl’s Law does not account for the overhead required to schedule, manage, and execute parallel tasks. Gustafson’s Law takes both of these additional factors into account.

Here is the formula to calculate speedup by using Gustafson’s Law.

S+N (1-S)

S+(1-S)

Speedup =

- O_n

G01PV06

In the above formula, S is the percentage of the serial code in the application. N is the number of processor cores, and O_n is the overhead from parallelization.

Article III. Software Patterns

Parallel programming is not a new concept; it has been around for some time, although mostly on large or distributed systems. Parallel computing has more recently been available on personal computers with Threading Building Blocks (TBB), OpenMP, and other parallel solutions. So although parallel computing might seem new, the concepts are actually mature. For example, design patterns have been developed to help programmers design, architect, and implement a robust, correct, and scalable parallel application. The book Patterns for Parallel Programming by Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill (Addison-Wesley Professional, 2004) provides a comprehensive study on parallel patterns along with a detailed explanation of the available design patterns and best practices for parallel programming. Another book, Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures by Colin Campbell et al. (Microsoft Press, 2010) is an important resource for patterns and best practices that target the .NET Framework and TPL.

Developers on average do not write much unique code. Most code concepts has been written somewhere before. Software pattern engineers research this universal code base to isolate standard patterns and solutions for common problems in a domain space. You can use these patterns as the building blocks that form the foundation of a stable application. Around these core components you add the unique code for your application, as illustrated in the following diagram. This approach not only results in a stable application but is also a highly efficient way to develop an application.

G01PV07

Jeanne: AU didn't provide an updated file, but I did a quick screen snip of this updated graphic here and will send it to you in email.

Design patterns should be an integral part of the software development life cycle of every application. These patterns require thorough knowledge of your problem domain. All object-oriented programs model a problem domain, and parallel applications are no exception. As applications become more complex, knowing the problem domain increases in importance.

Patterns for Parallel Programming defines four phases of parallel development:

· Finding Concurrency

· Algorithm Structures

· Support Structures

· Implementation Mechanisms

The first two phases are design and analysis, which include tasks such as finding exploitable concurrency. These phases are the precursors to actually writing code. Later, you map the analysis onto code by using the Support Structures and Implementation Mechanisms phases. The Implementation Mechanisms design phase is not reviewed in this chapter. You can consider the TPL as a generic implementation of this pattern; it maps parallel programming onto the .NET Framework.

I urge you to explore parallel design patterns so you can benefit from the hard work of other parallel applications developers.

Section 3.01 The Finding Concurrency Pattern

The first phase is the most important. In this phase, you identify exploitable concurrency. The challenge involves more than identifying opportunities for concurrency, because not every potential concurrency is worth pursuing. The goal is to isolate opportunities of concurrency that are worth exploiting.

The Finding Concurrency pattern begins with a review of the problem domain. Your goal is to isolate tasks that are good candidates for parallel programming—or conversely, exclude those that are not good candidates. You must weigh the benefit of exposing specific operations as parallel tasks versus the cost. For example, the performance gain for parallelizing a for loop with a short operation might not offset the scheduling overhead and the cost of running the task.

When searching for potential parallel tasks, review extended blocks of compute-bound code first. This is where you will typically find the most intense processing, and therefore also the greatest potential benefit from parallel execution.

Next, you decompose exploitable concurrency into parallel tasks. You can decompose operations on either the code or data axis (Task Decomposition and Data Decomposition, respectively). The typical approach is to decompose operations into several units. It’s easier to load balance a large number of discrete tasks than a handful of longer tasks. In addition, tasks of relatively equal length are easier to load balance than tasks of widely disparate length.

(a) The Task Decomposition Pattern

In the Task Decomposition pattern, you decompose code into separate parallel tasks that run independently, with minimal or no dependencies. For example, functions are often excellent candidates for refactoring as parallel tasks. In object-oriented programming, functions should do one thing. However, this is not always the case. For longer functions, evaluate whether the function performs more than one task. If so, you might be able to decompose the function into multiple discrete tasks, improving the opportunities for parallelism.

(b) The Data Decomposition Pattern

In the Data Decomposition pattern, you decompose data collections, such as lists, stacks, and queues, into partitions for parallel processing. Loops that iterate over data collections are the best locations for decomposing tasks by using the Data Decomposition pattern. Each task is identical but is assigned to a different portion of the data collection. If the tasks have short durations, you should consider grouping multiple tasks together so that they execute as a chunk on a thread, to improve overall efficiency.

(c) The Group Tasks Pattern

After completing the Task and Data Decomposition patterns, you will have a basket of tasks. The next two patterns identify relationships between these tasks. The Group Task pattern groups related tasks, whereas the Order Tasks pattern imposes an order to the execution of tasks.

You should consider grouping tasks in the following circumstances:

· Group tasks together that must start at the same time. The Order Task pattern can then refer to this group to set that constraint.

· Group tasks that contribute to the same calculation (reduction).

· Group tasks that share the same operation, such as loop operation.

· Group tasks that share a common resource, where simultaneous access is not thread safe.

The most important reason to create task groups is to place constraints on the entire group rather than on individual tasks.

(d) The Order Tasks Pattern

The Order Tasks pattern is the second pattern that sets dependencies based on task relationships. This pattern identifies dependencies that place constraints on the order (the sequence) of task execution. In this pattern, you often reference groups of tasks defined in the Group Task pattern. For example, you might reference a group of tasks that must start together.

Do not overuse this pattern. Ordering implies synchronization at best, and sequential execution at worst.

Some example of order dependencies are:

· Start dependency This is when tasks must start at the same time. Here the constraint is the start time.

· Predecessor dependency This occurs when one task must start prior to another task.

· Successor dependency This happens when a task is a continuation of another task.

· Data dependency This is when a task cannot start until certain information is available.

Parallel tasks may access shared data, which can be a dependency between tasks. Proper management is essential for correctness and to avoid problems such as race conditions and data corruptions. The Data Sharing pattern describes various methods for managing shared data. The goals are to ensure that tasks adhering to this pattern are thread safe and that the application remains scalable.

When possible, tasks should consume thread-local data. Thread-local data is private to the task and not accessible from other tasks. Because of this isolation, thread-local data is exempt from most data-sharing constraints. However, tasks that use thread-local data might require shared data for consolidation, accumulation, or other types of reduction. Reduction is the consolidation of partial results from separate parallel operations into a single value. When the reduction is performed, access to the shared data must be coordinated through some mechanism, such as thread synchronization. This is explained in more detail later in this book.

Sharing data is expensive. Proposed solutions to safely access shared data typically involve some sort of synchronization. The best solution for sharing data is not to share data. This includes copying the shared data to a thread-local variable. You can then access the data privately during a parallel operation. After the operation is complete, you can perform a replacement or some type of merge with the original shared data to minimize synchronization.

The type of data access can affect the level of synchronization. Common data access types are summarized here:

· Read-onlyThis is preferred and frequently requires no synchronization.

· Write-only You must have a policy to handle contention on the write. Alternatively, you can protect the data with exclusive locks. However, this can be expensive. An example of write-only is initializing a data collection from independent values.

· Read-write The key word here is the write. Copy the data to a thread-local variable. Perform the update operation. Write results to the shared data, which might require some level of synchronization. If more reads are expected than writes, you might want to implement a more optimistic data sharing model—for example, spinning instead of locking.

· Reduction The shared data is an accumulator. Copy the shared data to a thread-local variable. You can then perform an operation to generate a partial result. A reduction task is responsible for applying partial results to some sort of accumulator. Synchronization is limited to the reduction method, which is more efficient. This approach is can be used to calculate summations, averages, counts, maximum value, minimal value, and more.

Section 3.02 The Algorithm Structure Pattern

The result of the Finding Concurrency phase is a list of tasks, dependencies, and constraints for a parallel application. The phase also involves grouping related tasks and setting criteria for ordering tasks. In the Algorithm Structure phase, you select the algorithms you will use to execute the tasks. These are the algorithms that you will eventually implement for the program domain.

The algorithms included in the Algorithm Structure pattern adhere to four principles. These algorithms must:

· Make effective use of processors.

· Be transparent and maintainable by others.

· Be agnostic to hardware, operating system, and environment.

· Be efficient and scalable.

As mentioned, algorithms are implementation-agnostic. You might have constructs and features in your environment that help with parallel development and performance. The Implementation Mechanisms phase describes how to implement parallel patterns in your specific environment.

The Algorithm Structure pattern introduces several patterns based on algorithms:

· Task Parallelism Pattern Arrange tasks to run efficiently as independent parallel operations. Actually, having slightly more tasks than processor cores is preferable—especially for input/output bound tasks. Input/output bound tasks might become blocked during input/output operations. When this occurs, extra tasks might be needed to keep additional processor cores busy.

· Divide and Conquer Pattern Decompose a serial operation into parallel subtasks, each of which returns a partial solution. These partial solutions are then reintegrated to calculate a complete solution. Synchronization is required during the reintegration but not during the entire operation.

· Geometric Decomposition Pattern Reduce a data collection into chunks that are assigned the same parallel operation. Larger chunks can be harder to load balance, whereas smaller chunks are better for load balancing but are less efficient relative to parallelization overhead.

· Recursive Data Pattern Perform parallel operations on recursive data structures, such as trees and link lists.

· Pipeline Pattern Apply a sequence of parallel operations to a shared collection or independent data. The operations are ordered to form a pipeline of tasks that are applied to a data source. Each task in the pipeline represents a phase. You should have enough phases to keep each processor busy. At the start and end of pipeline operations, the pipeline might not be full. Otherwise, the pipeline is full with tasks and maximizes processor utilization.

Section 3.03 The Supporting Structures Pattern

The Supporting Structures pattern describes several ways to organize the implementation of parallel code. Fortunately, several of these patterns are already implemented in the TPL as part of the .NET Framework. For example, the .NET Framework 4 thread pool is one implementation of the Master/Worker pattern.

There are four Supporting Structures patterns:

· SPMD (Single Program/Multiple Data) A single parallel operation is applied to multiple data sequences. In a parallel program, the processor cores often execute the same task on a collection of data.

· Master/Worker The process (master) sets up a pool of executable units (workers), such as threads, that execute concurrently. There is also a collection of tasks whose execution is pending. Tasks are scheduled to run in parallel on available workers. In this manner, the workload can be balanced across multiple processors. The .NET Framework 4 thread pool provides an implementation of this pattern.

· Loop Parallelism Iterations of a sequential loop are converted into separate parallel operations. Resolving dependencies between loop iterations is one of the challenges. Such dependencies were perhaps inconsequential in sequential applications but are problematic in a parallel version. The .Net Framework 4 provides various solutions for loop parallelism, including Parallel.For, Parallel.ForEach, and PLINQ (Parallel Language Integration Query).

· Fork/Join Work is decomposed into separate tasks that complete some portion of the work. A unit of execution, such as a thread, spawns the separate tasks and then waits for them to complete. This is the pattern for the Parallel.Invoke method in the TPL.

The Supporting Structure phase also involves patterns for sharing data between multiple parallel tasks: the Shared Data, Shared Queue, and Distributed Array patterns, These are also already implemented in the .NET Framework, available as collections in the System.Collections.Concurrent namespace.

Article IV. Summary

Parallel programming techniques allow software applications to benefit from the rapid shift from single-core to multicore computers. Multicore computers will continue the growth in computing power as promised in Moore’s Law; however, the price for that continued growth is that developers have to be prepared to benefit from the shift in hardware architecture by learning parallel programming patterns and techniques.

In the .NET Framework 4, Microsoft has elevated parallel programming to a core technology with the introduction of the Task Parallel Library (TPL). Previously, parallel programming was an extension of the .NET Framework. Parallel programming has also been added to LINQ as Parallel LINQ (PLINQ).

The goal of parallel programming is to load balance the workload across all the available processor cores for maximum CPU utilization. Applications should scale as the number of processors increases in a multicore environment. Scaling will be less than linear relative to the number of cores, however, because other factors can affect performance.

Multiple threads can run concurrently on the same processor core. When this occurs, the threads alternately use the processor by interleaving. There are benefits to concurrent execution, such as more responsive user interfaces, but interleaving is not parallel execution. On a multicore machine, the threads can truly execute in parallel—with each thread running on a separate processor. This is both concurrent and parallel execution. When oversubscription occurs, interleaving can occur even in a multicore environment.

You can coordinate the actions of multiple threads with thread synchronization—For example, to access shared data safely. Some synchronization objects are lighter than others, so when possible, use critical sections for lightweight synchronization instead of semaphores or mutexes. Critical sections are especially helpful when the duration of contention is expected to be short. Another alternative is spinning, in the hope of avoiding a synchronization lock.

Speedup predicts the performance increase from running an application in a multicore environment. Amdahl’s Law calculates speedup based on the number of processors and percentage of the application that runs parallel. Gustafson’s Law calculates real-world speedup. This includes using the performance dividend for more work and parallel overhead.

Parallel computing is a mature concept with best practices and design patterns. The most important phase is Finding Concurrency. In this phase, you identify exploitable concurrency—the code most likely to benefit from parallelization. You can decompose your application into parallel tasks by using Task Decomposition and Data Decomposition patterns. Associations and dependencies between tasks are isolated in the Group Tasks and Order Tasks patterns. You can map tasks onto generic algorithms for concurrent execution in the Algorithm Structure pattern. The last phase, Implementation Mechanisms, is implemented in the TPL. In the next chapter, you will begin your exploration of the TPL with task parallelism.

Article V. Quick Reference

To	Do This
Implement parallel programming in.NET Framework 4	Leverage the TPL found in the System.Threading.Tasks namespace
Use LINQ with parallel programing	Use PLINQ
Calculate basic speedup	Apply Amdahl’s Law
Find potential concurrency in a problem domain	Apply the Finding Concurrency pattern
Resolve dependencies between parallel tasks	Use the Data Sharing pattern
Unroll sequential loops into parallel tasks	Use the Loop Parallelism pattern

New book: Parallel Programming with Microsoft Visual Studio 2010 Step by Step

Chapter 1

Introduction to Parallel Programming

Article I. Multicore Computing

Section 1.01 Multiple Instruction Streams/Multiple Data Streams

Section 1.02 Multithreading

Section 1.03 Synchronization

Article II. Speedup

Section 2.01 Amdahl’s Law

Section 2.02 Gustafson’s Law

Article III. Software Patterns

Section 3.01 The Finding Concurrency Pattern

(a) The Task Decomposition Pattern

(b) The Data Decomposition Pattern

(c) The Group Tasks Pattern

(d) The Order Tasks Pattern

(e) The Data Sharing Pattern

Section 3.02 The Algorithm Structure Pattern

Section 3.03 The Supporting Structures Pattern

Article IV. Summary

Article V. Quick Reference

Additional resources