Are we taking advantage of Parallelism?

Recently, a colleague of mine, Mark Friedman, posted a blog titled “Parallel Scalability Isn’t Child’s Play” in which he reviewed the merits of Amdahl Law vs. Gunther’s Law for determining the practical limits to parallelization. I would not argue with the premise of Mark’s blog that Parallelism is not child’s play. However, I do have alternate views of the use of Amdahl Law and Gunther’s Law that I posted on his blog. I think my views and comments on Mark’s blog warrant another blog to fully explain.

Speaking of child’s play: my 10-year old son recently made a two-part movie titled “the Way” and “the Way Back” complete with a full storyline, multiple sound tracks and narrations. He put these movies together with only the help of his eight-year old sister, using sample movie clips and stock photographs he found on his computer hard drive. He asked me for help getting his two masterpieces onto a DVD capable of playing on the average home DVD player. Also, he asked about the length of a typical movie playing in movie theaters around the U.S. (approximately 2 hours) and how much these movies cost at the movie theater (approximately $12 for adults and $8 for children, minus the popcorn). Based on my answers, he determined that he will charge 25 cents for people to watch his movies, because he wanted everyone to attend. I wanted to ask him how much he would charge someone who decided to watch only one of the clips. However, I didn’t because I did not want to lose a price haggling war with a 10-year old. Besides, it would be terrible if you cannot find your way back.

In any case, his movies were quite impressive. The most technologically savvy thing I did as a 10-year old kid was to build a telephone line with tomato soup cans and a string. Movie making was out of reach for me; but now it is child’s play.

Today, parallelism is not child’s play. However, I hold out hope that in the future the typical computer program would be written with parallelism in mind. Is parallelism ever going to be child’s play in the future the way movie making is today?

Parallelism exists everywhere: instruction level, memory level, loop level and task level parallelism, etc. Also, parallelism has been with us for quite some time now. For the past several decades, hardware engineers have quietly been busy solving problems in parallel to improve processor and system level performance. However, for the past four or more years, hardware designers have encountered the twin brick walls created by memory speed and power. These walls have forced CPU architects and hardware designers to go multi-core in a major way. The doubling of the CPU frequency every 18 months, that was true for many decades, are no longer practical and have come to an abrupt end. Although, hardware performance continues to improve as my colleagues and I pointed out in our blog “Investigating a Pleasant Surprise,” the pace of CPU frequency increases has slowed considerably. Instead, hardware designers have been doubling the number of cores available on a single CPU socket every couple of years.

To get the same level of performance that was previously possible, software engineers would now need to step up to the plate—to write software in a parallel and scalable fashion. They would need tools and frameworks that allow them to think about their problems, identify opportunities for parallelism and to analyze their solutions correctly and efficiently.

I am a big fan of Amdahl Law as an analysis framework. However, I do not subscribe to the narrow view that Amdahl’s Law applies only to parallelism, as most people who write about it seem to imply. I prefer the broader treatment of the Law by Hennessy and Patterson in their famous book “Computer Architecture: A Quantitative Approach”—where Amdahl’s Law is used to estimate the opportunities between competing designs. Amdahl’s Law is very powerful for showing the areas that will likely yield the most fruitful performance gains. In my performance design, tuning and optimization work, I use Amdahl Law for prioritizing the areas of opportunities to focus my efforts to gain performance.

Amdahl’s Law is not the limit to either absolute performance or parallelism as many authors seem to suggest. Gunther’s and Gustafson’s Laws are helpful for putting Amdahl’s Law in perspective. However, like Amdahl’s Law, these laws are not fundamental limits. The use of these three laws to estimate the level of parallelism that is possible is very flawed. Specifically, the use of these laws as fundamental limits can obfuscate the level of parallelism and performance inherent in typical computing problems. These laws gloss over a number of important points and practical aspects of obtaining parallelism in general purpose computing, including that:

1. Many user tasks are non-monolithic and can be solved in a distributed fashion. Background tasks (e.g., virus scans) that often block single processor execution can now be done in a way that improves user experiences. The key is to identify unnecessary dependencies that would allow these tasks to proceed in parallel with other tasks in a multi-core computer.

2. Some algorithms that have inefficient sequential solutions surprisingly have efficient parallel solutions. This fact should be comforting to fans of algorithms. For example, many applications require matrix multiplication, which turns out to be easily parallelizable. Although the best sequential algorithm for matrix multiplication has a time complexity of O(n2.376), a straight-forward parallel solution has an asymptotic time complexity of O(log n) using n2.376 processor. In other words, we can readily find a parallel solution for matrix multiplication that improves its runtime as more and more processor cores become available. Of course, you might have difficulty conceiving of n2.376 processors in a system--as a colleague mentioned recently. However, this is just another way of saying that matrix multiplication will benefit with more and more processors.

3. Some poor sequential algorithms can be easily parallelized to execute in less time than their sequential solutions. Also, we know that some algorithms that have the best asymptotic time complexities achieve their speed by introducing data dependencies that make parallelization difficult and that the best asymptotic time complexity does not necessarily translate to the best runtime in real life. Hence, at some point, the benefit of the simpler parallelization of some poor sequential algorithms that have little data dependencies can outweigh the benefit of more efficient sequential counterparts that have data dependencies. Hence, when considering parallel solutions it is not always necessary to start with the sequential solution with the best time complexity [also, see comment about Fortune and Wylie below].

4. The real world performance of applications is not determined exclusively by the asymptotic time complexity of algorithms. Because of the increasing gap between CPU and memory speed, memory accesses are increasingly dominating the performance of applications running on modern CPUs. Although, the gap can be mitigated with large caches, every cache miss takes hundreds of CPU cycles to complete. Even a modest overlap in these memory accesses (Memory Level Parallelism) can improve application performance in noticeable ways.

Over the years, there have been efforts to classify computationally intractable problems. Many decision problems (i.e., Yes/No) and their optimization counterparts have been categorized into NP-complete and NP-Hard sets respectively. The Travelling Salesman (TSP), Online Bin-Packing and 3-Dimensional Matching problems are three famous examples of NP-Complete problems. In a similar fashion, problems that are difficult to parallelize have been categorized into the P-Complete set or the set of problems that are known to be inherently sequential. As you can imagine sorting is not P-Complete. Likewise, Matrix Multiplication is not in the P-Complete set. Processor scheduling can be done in O(log n) time units using n processors—so, it is not P-Complete either. In an ultimate twist of irony, many NP-Hard problems have heuristic solutions that can be executed in parallel to approximate the real solutions. Hence, the natural inclination to think that NP-Complete problems cannot be parallelized is not borne out in practice.

As it turns out, the real limit to parallelism seem not to be defined by Amdahl’s Law, Gunther’s Law, Gustafson’s Law or NP-Completeness, but by the P-Complete set. It appears that parallelizable problems are related to the asymptotic space complexity of their sequential solutions. According to the Fortune and Wylie’s Parallel Processing Thesis, any problem that can be solved with a poly-logarithmic space complexity can be parallelized efficiently. Because of the time space trade-off of algorithms, this implies that the sequential algorithm that achieves this space complexity is not necessarily the algorithm with the best asymptotic time complexity.

In any case, because one can evaluate problems on multiple levels beyond algorithms (e.g., at the instruction, memory and data access, loop and task levels), the set of problems that can be parallelized appears to be quite large. The question is how to identify and take advantage of the parallelization opportunities that may be inherently available and to do so in an efficient and scalable manner. How can we parallelize loops? How do we overlap high latency activities such as accesses to physical memory or I/O to amortize the cost of those activities? How do we minimize synchronizations? How do we partition tasks to eliminate bottlenecks from the critical paths? How do we dispatch work efficiently to improve efficient system utilization, improve throughput and improve latency? What areas of our application can benefit from what sets of efforts? These are some of the questions that allow for scalable designs. 

Today, the tools to identify parallelism and scalability opportunities are very limited. The programming languages that allow programmers to express parallelism in a natural way are completely lacking. The tools to analyze and troubleshoot parallel implementations are limited as well. Debugging parallel implementation is particularly hard. However, I suspect that with some industry focus and incremental progress, we could continue to make parallelism accessible to average programmers in a few years. However, we are many years away.

What are some of the fundamental limits preventing such tools to be built? Like Mark said on his blog, achieving improved scalability using parallel programming techniques is certainly very challenging. But, can parallel programming be made less challenging with intuitive tools that expose parallel solutions in a natural way and allow programmers to exploit them? Can programming languages and tools improve to a point where a typical 10-year old will be able to write a parallel program as easily as they can put together a multi-track movie today?

Sunny Egbo