Parallel Scalability Isn’t Child’s Play


In a recent blog entry, Dr. Neil Gunther, a colleague from the Computer Measurement Group (CMG), warned about unrealistic expectations being raised with regard to the performance of parallel programs on current multi-core hardware. Neil’s blog entry highlighted a dismal parallel programming experience publicized in a recent press release from the Sandia Labs in Albuquerque, New Mexico. Sandia Labs is a research facility operated by the U.S. Department of Energy.


According to the press release, scientists at Sandia Labs


simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.” They concluded that this retrograde speed-up was due to deficiencies in “memory bandwidth as well as contention between processors over the memory bus available to each processor.


Holy cow. The Lab’s scientists, who are heavily invested in parallel programming on supercomputers, simulated running programs on sixteen cores encapsulating “key algorithms for deriving knowledge from large data sets” that gave no better performance than running the same program on two cores.


Please note that these are simulated performance results, because 16-core machines of the type being simulated don’t currently exist. Indeed, I would not expect that 16-core machines of the type being simulated would ever exist. Which leads me to wonder what the point of this Sandia Labs exercise was.


Of course, for developers experienced in parallel programming, this result actually isn’t in itself all that surprising. Quite frequently, experienced developers find that running their multi-threaded application on massively parallel hardware does not scale well with the hardware capabilities. This was apparently the case for the applications the Sandia Labs folks simulated. So what? Should we just give up in our quest for parallel program scalability?


Before drilling into Dr. Gunther’s specific interest in this disclosure, it is worth looking into the Sandia Labs finding in a bit more detail. For instance, did anyone, besides me, wonder what applications were being simulated?


In theory, “deriving knowledge from large data sets” is a category of computing program that readily lends itself to a solution using an SIMD (Single Instruction, Multiple Data) approach. The canonical example of an SIMD approach to “deriving knowledge from large data sets” is a database Search function conducted in parallel where the data set of interest is partitioned across n processing units and their locally attached disks. For example, when the Thinking Machines CM-1 supercomputer publicly debuted in the mid-80s, the company demonstrated its capabilities using a parallel search of a database that was partitioned across all 64K nodes of the machine, which was based on the Connection Machine originally designed by MIT whiz kid Danny Hillis. Parallel search when executed across a partitioned dataset should scale linearly, or close enough for government work (pun intended).


Whenever a problem lends itself to an SIMD approach (also known as “divide and conquer”), linear scalability of the SIMD algorithm does require first partitioning the data being accessed and then proceeding to process that data in parallel. I am sure the point of the Sandia Labs press release was not to disparage the SIMD approach to parallel processing; after all, that is a tried-and-true technique that they have used with great success over the years.


On the contrary, it appears to be a critique of an approach to building parallel processing hardware  where you would increase the number of processing cores on the chip (just because you can with the most current semiconductor fabrication technology) without scaling the memory bandwidth proportionally. Since that is not what is happening hardware-wise, it strikes me that this implied criticism of the multi-core hardware  strategy Intel and AMD are pursuing is slaying a non-existent dragon. Both Intel and AMD recognize that memory bus bandwidth is a significant potential bottleneck in their multi-core products, and, as a result, both manufacturers are attempting to scale memory bandwidth proportional to the amount of processing power they deliver on a chip.


So then what is all the fuss about? The Sandia Labs “news” starts to look like something the blogosphere is latching onto on an otherwise slow day for tech news, raising an alarm & potentially misleading naïve readers about what the conventional wisdom in multiprocessor chip architecture would be if anyone were actually trying to build multi-core microprocessors that way.


Building a better multicore processor.


The point of the Sandia Labs press release publicizing these simulation results appears to be to suggest what they consider better approaches to packaging multi-core processors on a single socket. They released the following chart that that makes this point (reproduced here in Figure 1):


Sandia Labs multicore simulation results


 


 


 


Figure 1. Sandia Labs simulation showing performance of their application vs. the number of processors.


Exactly what the Sandia Labs folks are reporting here is a little sketchy. Presumably, the simulations are based on observing the behavior of some of their key programs where they were able to measure performance running on “conventional” multi-core processors, perhaps, something like the quad-core machine I recently installed for my desktop that uses a memory bus with bandwidth in the range of 10 GB/sec. The press release seems to imply that the Sandia Labs baseline measurements were taken on current quad-core machines from Intel like mine, not the newer Nehalem processors where the memory architecture has been re-worked extensively. How useful or meaningful the results that Sandia Labs published may turn on this crucial point.


This Sandia Labs simulation then extrapolates out to 16 cores per socket (and beyond), simulating the manufacturer adding more cores to the die, apparently leaving the memory architecture fundamentally unchanged as they moved to more cores. The Sandia Labs chart in Figure 1 is labeled to indicate that the memory bandwidth was held constant at 10 GB/sec.


This is more than a little suspicious. Hardware manufacturers like Intel and AMD understand clearly that the memory bus has to scale with the number of processors. The AMD HyperTransport bus architecture is quite explicit about this, and the latest spec for HyperTransport version 3.1 has an aggregate bandwidth in excess of 50 GB/sec.


Meanwhile, the memory bus capacity on the latest Nehalem-class processors from Intel has been boosted significantly. Alternatively, it is when you cannot scale the memory bus with processor capacity that machines with NUMA architectures become more attractive. The AMD processors use HyperTransport links  in a ring topology that implicitly leads to NUMA-characteristics.


In Intel’s approach to NUMA scalability, some small number of processors share a common memory bus, forming a node. Current Nehalem machines (also known as the Core i7 architecture) have four cores sharing the Front-side memory bus (FSB). The physical layout of this chip is photographed in Figure 2, showing four cores, connected to DDR3 DRAM using an integrated memory controller. I wasn’t able to come find a speed rating for the FSB in the Nehalem on Intel’s web site or elsewhere, other than ballpark estimates that puts its speed in the range of 30-40 GB/sec. The QuickConnect technology links that are used to link memory controllers support 25 GB/sec transfers, which is probably a safe lower bound on the capacity of the FSB.


Core i7 4-way multiprocessor photo


Figure 2. Aerial photograph showing the layout of the 4-way Core i7 (Nehalem) microprocessor chip.


The PIM architecture, whose scalability curves are close to ideal for the Sandia Labs workloads is, probably not coincidentally, a processor architecture championed at Sandia Labs. The idea behind PIM machines (Processor In Memory) is that the processor (or processors) is embedded into the memory chip itself, which is a pretty interest approach to solving the “memory wall” that limits performance in today’s dominant computer architectures. Instead of loading up the microprocessor socket with more and more cores, which is the professed hardware roadmap at Intel & AMD, integrating memory into the socket is an intriguing alternative. Such machines, if anyone were to build them, would obviously have NUMA performance characteristics.


The debate is a bit academic for my taste, however, until these PIM architecture machines are a reality. For PIM architecture machines to ever get traction, either the microprocessor manufacturers would have to start building DRAM chips or the DRAM manufacturers would have to start building microprocessors. The way the semiconductor fabrication business is stratified today, that does not appear to be very likely in the near future.


So, in the end, the point of the Sandia Labs press release appears to be trying to publicize the multiprocessor hardware direction espoused mainly by Sandia Labs’ own researchers. Frankly, there have been lots and lots of different architectural approaches to parallel processing over the years, and it doesn’t look like any one approach is optimal for all computing situations. You ought to be pick another parallel programming workload to simulate in Figure 1 and get an entirely different ranking of the approaches.


Still, the Sandia Labs simulation data are interesting mainly for they say about how difficult it is going to be for developers to write parallel programs that scale well on multi-core machines. No, achieving parallel isn’t child’s play for hardware manufacturers. Nor is it for software developers attempting to take advantage of parallel processing hardware, which is the subject I will start to drill into next time.


Continue to Part 2…..

Comments (9)

  1. Sunny says:

    There is no FSB in Nehalem processor based platforms. Nehalelem uses Intel QuickPath Interconnect technology to connect multiple processor and the I/O hub. See http://www.intel.com/technology/quickpath/index.htm for more information.

  2. Antonio Drusin says:

    I think there is at least one (semi) PIM machine out there. The core processor by IBM/Toshiba/Sony. It has 256KB of RAM attached to each processing unit and uses explicit DMA to push/pull info from external memory. It seems to perform well.

  3. Part 1 of this series of blog entries discussed results from simulating the performance of a massively

  4. Jibey Jacob says:

    Don’t most Web servers theoretically host massively parallel apps?  Each HTTP request is processed by spinning off a thread, and HTTP is ideal for this since its stateless. So,  according to this article, IIS wouldn’t perform exponentially well going from two to four cores and higher:

    http://blogs.msdn.com/ddperf/archive/2009/03/16/parallel-scalability-isn-t-child-s-play.aspx

  5. Jibey Jacob says:

    Allow me to refine what I’ve said. Since this article is suggesting that the problem is with multicore hardware,  all claims that have been made about all these Web servers being massively scalable is false.

  6. Excellent question/comment.

    I wasn’t particularly headed in that direction originally. But I do like the direction, and I promise I will address it in more detail at some point down the road before I get to the end of this topic. Web-site scalability isn’t always child’s play either. Reviewing published TPC benchmarks results is very instructive in this regard — see http://www.tpc.org for details. Note how the price per request tends to increase as you run the benchmarks on machines with more and more processors.

    As you suggest, the characteristics of most Web-based applications do, in fact, lend themselves to parallel processing. The stateless nature of the HTTP protocol certainly promotes this. HTTP requests can normally be processed in parallel by asynchronous agents, which in ASP.NET are dispatched and processed by individual worker threads from a thread pool. Transaction processing workloads, in general, share these characteristics. As you continue through the blog entries in this series, you will see that I do start to drill in on the performance characteristics of the .NET thread pool.

    In the context of this set of postings, however, I have been avoiding transaction processing workloads generally. I am focusing mainly on the parallel programming initiatives here that are designed to give developers better programming abstractions and models for developing concurrent programs. HTTP request processing already has an inherent concurrent programming model that you are correct in thinking software developers have had no difficulty in adopting.

    Some approaches to parallel programming that I regard very highly try to apply a transaction processing pattern to other types of computing problems. This can be traced at least as far back as Tony Hoare’s "cooperating sequential processes" paradigm. A recent example is Rick Molloys’s MSDN article entitled "Solving The Dining Philosophers Problem With Asynchronous Agents" available at http://msdn.microsoft.com/en-us/magazine/dd882512.aspx.

    I am a big proponent of this approach. For example, there was a Microsoft Research project called Singularity (which you can read about here: http://msdn.microsoft.com/en-us/magazine/cc163603.aspx) that relies exclusively on message passing between asynchronous agents as the sole method of interprocess communication in an OS. That is a great idea, I believe.

    There is one caveat to be aware of, though, when considering ASP.NET workloads. For a number of reasons, not all ASP.NET request processing conforms to the pure, connectionless and stateless model for HTTP requests. Most transaction processing web sites maintain at least some session-oriented state in order to improve the user experience in trafficking on the web. (A good example from commercial transaction processing web sites is associating the user request with a specific, retained shopping cart value.)

    Moreover, to improve performance, many ASP.NET applications rely on caching, where frequently re-used state is retained across requests. (A decent, although, unfortunately, not particularly current overview of the ASP.NET Caching mechanisms is here: http://msdn.microsoft.com/en-us/library/aa478965.aspx#aspnet-cachingtechniquesbestpract_topic4.) In ASP.NET, View state, Session state, the Page cache, and the Application cache are all mechanisms for retaining state across requests that larger web sites rely on to improve performance and scalability. All these caching mechanisms have significant performance & scalability performance considerations. ASP.NET also supports standard state-retention mechanisms that are incorporated into the HTTP protocol, such as cookies and query strings. IIS itself supplies additional caching mechanisms. And, finally, there is a Microsoft distributed caching technology known as Velocity (see http://msdn.microsoft.com/en-us/library/cc645013.aspx and http://code.msdn.microsoft.com/velocity for details) that can be used across clustered ASP.NET web applications. Maintaining cache coherence in the ASP.NET Application-level cache, for example, has many of the same issues around parallel programming and coherence-oriented delays that I am highlighting here.

    Thanks for the comment, and stay tuned for more on this topic.

    — Mark Friedman

  7. To your first point, transaction processing workloads (which includes web-oriented request processing) are inherently parallel. (But they can still block — there are lots of gotchas.)

    To your second comment, not exactly the way I would phrase it, but I am cynical, too, about any general claim to massive scalability. In the immortal words of Ronald Reagan, "Trust, but verify."

    It is worth noting that shared-nothing clustered approaches in massively parallel transaction processing have achieved some impressive results in scaling near linearly up to a 1000 machines, for example. We are generally trying to achieve something similar today on more general purpose (i.e., commodity) hardware, which is a challenge. Scalability can be achieved, but requires a long, tough push to get there on both the hardware and software sides. That holds for web applications, databases, HPC (e.g., OpenMP), etc. Minimizing shared state is crucial to achieiving good scalability in all these application domains.

  8. Jibey Jacob says:

    ASP.NET adds a lot of functionality over HTTP to make Web programming easier, and it has been my experience that state management in ASP.NET does not scale very well. Cookies and query strings in HTTP does not by any way impose performance penalties as those imposed by Session State, View State, etc. in ASP.NET.

    While new APIs that have been introduced in Windows Server 2003 for applications to take advantage of NUMA architectures are useful, well written multithreaded code that simply use primitive thread synchronization mechanisms would scale well on NUMA architectures as the Windows scheduler is intelligent enough to facilitate those apps with the advantages offered by NUMA architectures.

  9. Jibey Jacob says:

    And there’s a real problem generally in any computing network that uses Ethernet that I haven’t seen addressed. Ethernet’s global hierarchical model and the use of Ethernet device drivers in the kernel with full privileges opens up the kernel to all kinds of attacks. Multithreaded programming would be impossible on such networks. I’ve sworn to never again work for any employer unless:

    1. Everything attached to the Internet around the planet is upgraded to IPv6. This implies use of InfiniBand at Layer 2, in lieu of Ethernet.

    2. The Internet is enhanced with the logic to biometrically authenticate human users and determine that the person being authenticated cannot be at more than one place at the same time. This requires the development of a new routing protocol that enhances IPv6 and one that uses GPS-bases addressing in addition to the functionality in IPv6 addressing.

    3. WiMAX is rolled out around the planet so I can use a VoIP phone on this newly enhanced Internet wherever I go. My phone must be intelligent enough to tell me whether any phone number is on a PSTN, cell network or if its a VoIP number, and I wouldn’t want to deal with anyone on a PSTN or cell network.

    4. When I do start working again, I wouldn’t want any device drivers running in the Windows kernel other than signed drivers from Microsoft. All other drivers will have to be user-mode drivers signed by their respective developer companies and certified by Microsoft HQL.

    4. All monies that are due to me must be paid to me ASAP.

    Without all of this in place, I just can’t be sure that I’m getting paid for my work.