Lessons from the test lab: investigating a pleasant surprise


This post describes our recent investigation into an interesting performance problem: benchmarks that we were surprised to find running significantly faster than we expected on new hardware. Along the way we discuss useful benchmarking tools, how to validate results, and why it pays to know exactly what hardware you’re running on.

This all started in our performance test lab. During the development of Visual Studio, each new build undergoes a suite of automated performance tests, running in a lab full of identical machines. These performance tests allow us to track Visual Studio’s performance over time, and detect performance regressions (when something gets unexpectedly worse). We recently added a batch of new machines in our lab, and that’s when the fun started.

Pop Quiz: How Much Faster?

Old machine: dual-core Intel Pentium D 830 processor, running at 3 GHz, with 1 GB of RAM.

New machine: quad-core Intel Xeon 5355 processor, running at 2.66 GHz, with 4 GB of RAM.

Given the differences in the two hardware configurations above, how much faster would you expect the new machine to be when running a Visual Studio performance test? Lower than, same as, twice, three times or four times the performance of the older machine?

One line of reasoning might look at the relative clock frequencies of the processors on the two machines. This might lead you to expect the newer processor cores to perform slower than the older cores, since their clock frequency is 11% lower. By this reasoning you might conclude that single-threaded applications would perform poorly on the new machine.

Another line of reasoning would factor in the number of cores in the two systems. Since the new machine has twice the number of cores, you might expect it to have about twice the performance on multi-threaded applications. (If you also accounted for the lower clock frequency, you’d end up with a figure of 1.78 times the performance of the old machine.)

A third approach might estimate the impact of RAM size. We’ve quadrupled the amount of RAM, so maybe any benchmarks that used to page to disk can now execute entirely in memory and hence will be orders of magnitude faster. [We’ll cheat here and tell you that our benchmarks are generally not memory constrained].

So far, all these options seem plausible. What’s your guess?

What we naively expected to find lay somewhere between the first two lines of reasoning – that the new machines would be 1-2 times faster than the old machines, depending on the particular benchmark.

What we actually found is that many of our single-threaded CPU-bound benchmarks run about twice as fast on the new machine, while scalable multi-threaded benchmarks run up to four times as fast. This was a pleasant surprise, because it significantly reduces the overall time to run all the benchmarks. But it did leave us wondering why we were getting much greater speedups than our naive explanations would suggest. The rest of this post explores that question.

Using WinSAT and SPEC to Validate Benchmark Results

To make sure this wasn’t a fluke result, we used the Windows System Assessment Tool (winsat.exe). This is a built-in tool that can give quickly give a representative view of a machine’s performance. It is multi-threaded, taking full advantage of all the cores on a machine. Here are the WinSAT CPU results:

Benchmark Old Machine New Machine Speedup
CPU – Compression (MB/s) 70.5 262.0 3.7
CPU – Encryption (MB/s) 52.3 139.3 2.7

We also wanted to validate our results against other real-world benchmarks. For this we turned to the SPEC website. SPEC produces a series of benchmark suites, plus a very formal process that ensures results are reproducible and can fairly be applied across different manufacturers. More importantly for our purposes, SPEC posts all reported benchmark results on their web site. You won’t always be able to find your exact machine listed, but after using results from a tool like CPU-Z you can generally find results from a machine with the same CPU configuration and clock speed.

We used the "CINT2006" benchmarks – this is a widely-used benchmark suite concentrating on integer performance. We compared results for both CINT2006, which is a good test of single-threaded performance, and CINT2006 Rate, which tests the ability of a system to execute multiple copies of CINT2006, and is therefore a better test of multi-threaded performance. For two representative machines that are similar to our old and new hardware, here are the results:

Benchmark Old Machine New Machine Speedup
CINT2006 9.85 15.5 1.6
CINT2006 Rate 18.0 44.4 2.5

The WinSAT and SPEC results confirm that the new machines are much faster than our naive expectations, even for benchmarks such as CINT2006 that cannot take advantage of the extra cores. So what were we missing?

Using CPU-Z to Examining Machine Configurations

To answer this, we need a deeper understanding of the configurations of the two systems.

Unfortunately, finding detailed configuration information isn’t always straightforward. For example, we know that level two (L2) cache size impacts performance, but Windows doesn’t report it, and it’s not easy to reboot into the BIOS to take a look at cache size when the machine is located in a remote test lab. This is where machine reporting tools like CPU-Z come in. You can run CPU-Z remotely on an unknown machine and get back a nicely formatted HTML report showing exactly what the hardware is. Here’s a deeper look at our old and new systems:

Feature Old Machine New Machine
CPU name Pentium D 830
(“Smithfield”)
Xeon X5355
(“Clovertown”)
CPU speed 3.00 GHz 2.66 GHz
Number of cores 2 4
L1 cache (per core) 16 KB 32 KB
L2 cache (total) 2 MB 8 MB
System RAM 1 GB DDR2 4 GB DDR2

Using BCDEdit to Disable Cores

Now we can try to tease out the relative impacts of the many changes from the old configurations the new configurations. The first and easiest step is to disable two out of four cores on a new machine, to enable a fairer "apples to apples" comparison of cores between old and new machines.

To do this we used the Windows BCDEdit tool, which replaces the old method of editing BOOT.INI by hand. Here we were particularly concerned with the order in which cores are disabled. This is important because the 8 MB of L2 cache in the Xeon “Clovertown” processors is divided: two of the four cores share 4 MB, and the other two cores share the other 4 MB. To keep our benchmark comparisons as fair as possible, we wanted to make sure that only one of the L2 caches was in use after disabling two cores. We used CPU-Z again after rebooting to confirm this.

Now we were in a position to do a fairer “cores to cores” comparison between the old and new machines. Here’s a summary from WinSAT:

Benchmark Old Machine New (2 cores) Speedup
CPU – Compression (MB/s) 70.5 131.9 1.9
CPU – Encryption (MB/s) 52.3 69.7 1.3
Memory Bandwidth (MB/s) 4,041 3,360 0.8

Now we can really see the advantage of the latest processors – on a core-for-core basis, they are 1.3-1.9x faster on the CPU-intensive WinSAT benchmarks, despite having lower clock frequencies.

Good, now on to the next… wait a second. Look at that memory bandwidth result. Our new machines have less memory bandwidth than the old machines? That doesn’t look right: although memory performance hasn’t been keeping pace with CPU speeds, it has been improving over time. Compared to a three-year-old machine, we’d expect these new machines to have slightly better memory bandwidth, and definitely not worse. What gives?

Memory Channels

A primary limiting factor to memory bandwidth is the number of memory channels that are in use. And this turns out to be the problem here: although the new machines have four memory channels and eight memory slots, only two of those slots are filled, because the vendor supplied us with two 2 GB memory modules per machine. This maximizes future expansion potential – we can take the machine up to 16 GB without throwing away any of our initial investment in memory. But in the meantime using two memory slots limits us to two memory channels in use. If instead we had four 1 GB memory modules we’d have four memory channels in use, improving memory interleaving from 2:1 to 4:1 and increasing memory bandwidth. To confirm this, we populated four memory slots on one of the new machines (going from 4 GB to 8 GB) and reran WinSAT:

Benchmark 2 channels 4 channels Speedup
Memory Bandwidth (MB/s) 3,360 4,134 1.2

Conclusions

It’s always possible to run more experiments to further isolate and explain benchmark results, but after a while you reach a point of diminishing returns. With the results we have so far, we can already draw some useful conclusions.

The first conclusion is that our naive explanations greatly underestimated just how much better the newer processors are at executing real benchmarks, despite their slower clock speeds. The results from WinSAT and SPEC clearly show this, with core-to-core performance that is 1.3-1.9x faster on the new machines, depending on the benchmark.

This is perhaps the most important lesson for developers to learn: clock speeds are no longer a good indicator of true performance. Although clock speeds have plateaued, processor designers continue to find ways to make each new generation significantly faster than the last. In our case, the old machines have Pentium D processors (“Smithfield”), while the new machines have Xeon 5-series processors (“Clovertown”).  And while the newer processors have slightly slower clock speeds, their micro-architecture executes more instructions per clock cycle.

The second conclusion is that it’s very hard to perform fair comparisons. The two machines have several configuration differences, including clock frequency, number of cores, core micro-architecture, cache sizes, bus speed, memory size and speed, and so on. We showed an example of isolating the effect of just one of these differences, the number of cores, using the BCDEdit tool. Isolating the effect of every single difference would require much more effort.

Indeed, some of these differences are interrelated, and it is hard to change one without affecting another. For example, CPU architects make their micro-architecture design decisions based on cache sizes. Now imagine a hypothetical experiment that tried to isolate the effect of L2 cache size by giving each core just 1 MB of cache. This would be especially hard on the newer processors, which have been designed on the assumption that they have 2 MB of L2 cache per core[1]. In trying to perform a fairer comparison, we would have actually handicapped one system!

Our final conclusion is that it truly pays to benchmark and compare systems. In our case, the simplest possible benchmark (WinSAT) showed an unexpected memory bandwidth loss, which we then traced back to a machine mis-configuration. So that was the final pleasant surprise: if we hadn’t gotten curious about why the new machines were so much faster, we would never have found that they could be faster still!

David Berg
Sunny Egbo
Jonathan Hardwick
Peter Okonski


[1] Because two cores share a single 4 MB L2 cache on the Clovertown processors, the exact size of the cache that is used by each core is not fixed at 2 MB per core; the use will vary during program execution. Cache hungry threads might get more of the cache, while less cache hungry threads get less. Even when two cache hungry threads run on the two cores, their memory hotspots are asynchronous; thus, the net effect is that each thread gets more of the cache when they need it and less when they don’t need it.

Comments (8)

  1. Even though I’ve been doing general architecture work on Visual Studio for nearly a year now, my friends

  2. Antonio D says:

    To explain the memory bandwidth difference…

    The pentium 4 (830D) was a more memory hungry architecture. Risking a gross oversimplification I would say that the 830 D agressively transfers memory in its cache, even if it ends up not doing it. So the bandwidth measured by a benchmark in a best case scenario could be much higher than the actual bandwith that your software can use.

    About the amount of cache used by a thread.

    I was under the impression that Windows scheduled threads on different cores. i.e. thread x is not guaranteed and is not going to run always on the first core of the processor. So if you have two concurrent threads, they are probably running both on both (or all four) cores. In this case the amount of cache used by each thread would be undetermined (except by further analysis).

    My understanding is that the part where you say that thread gets more of the cache when it needs it is is true regardless of the number of cores or wether these cores are sharing a cache or not. What am I missing?

  3. MarkBFriedman says:

    Antonio:

    It looks like we should have been a little clearer about what we meant when we used the word "thread." Sorry about that. (Reminds me of the famous words of a former US President and semiotician, "It depends on what the meaning of the word "is" is.")

    From the standpoint of the OS, the thread is the dispatchable unit. From the standpoint of the CPU, a thread is any set of executing instructions that aren’t executing an Idle loop. There are software and hardware guys collaborating on this post. (It may be unusual, but they do get along — most days.) And while we knew what we meant, it seems we used "thread" without clearly distinguishing the two meanings and contexts.

    Knowing the author and his tendencies, my guess is that the footnote about threads sharing the cache was written from the hardware perpective.

    On the new lab machines, there is a dedicated L1 cache for each processor core, and a shared L2 cache that each processor core can access. The L2 cache is dynamically allocated. If CPU A is idle and CPU B is cranking, CPU B is capable of allocating the entire L2 cache. (If you don’t expect the CPUs sharing the cache to all be cranking all of the time, this is probably a good approach.) I hope that clarifies the point.

    Of course, I am a software guy, so, from the Windows point of view, let me also try to clarify your thread dispatching question:

    It is true that "thread x is not guaranteed and is not going to run always on the first core of the processor." Having said that, however, the statement that follows isn’t entirely true: "So if you have two concurrent threads, they are probably running both on both (or all four) cores."

    Yes and no. On a symmetric multiprocessor (SMP), a thread by default tends to be a bit sticky to the processor it was last dispatched on. This is called "soft affinity" and is done to increase the probablity that a cache warm start will occur. This stickiness is especially noticeable when the processors are lightly loaded. The stickiness is also prominent in the WinSat benchmarks described here that run single threaded and were run in insolation.

    But, in general, you are correct and you often observe threads switching back and forth between available processors. Because thread scheduling is priority-based with preemptive scheduling, and User mode threads are typically subject to dynamic adjustments, once the machine is loaded, threads will usually wander (somewhat randomly) from CPU to CPU.

    The SMP soft affinity scheduling algorithm is roughly as follows: A waiting thread that transitions to the Ready state has an "ideal processor" where it will run if that processor is currently idle or running a lower priority thread. If the ideal (i.e., last) processor is busy or running a higher priority thread, but another processor is idle or running a lower priority thread, the ready thread will be scheduled there. This is the preemptive scheduling bit — the highest priority Ready threads are always dispatched.

    You will find more details in my Windows Server 2003 Resource Kit "Performance Guide" book: the priority scheme, hard processor affinity, etc. The ntttcp program discussed in my recent "Mainstream NUMA and the TCP/IP stack" post used hard processor affinity, for example, to ensure that all network processing was confined to a single CPU. Which was why I only showed what was happening on the one CPU. Hard affinity is the exception, though, not the rule.

    Once you move to a NUMA architecture — see my earlier blog posts on this subject, like it or not, NUMA is in your future if you are running server-class machines — the thread scheduling scheme gets node-oriented. (Physical memory allocations are also node-oriented on NUMA machines.) Once scheduled on a node, a thread is likely to continue to be scheduled to run on one of that node’s CPUs. (Subject to availability, similar to the SMP case.)

    This NUMA node-oriented soft affinity scheme works pretty well when the L2 cache is a resource that is shared by all the processor cores on the socket. In today’s multi-core machines, so long as the thread is re-dispatched to a CPU on the same socket (or node) where it last ran, the thread will likely benefit from an L2 cache warm start. But for an L1 cache warm start, the thread still has to be dispatched on its ideal (and still preferred) processor since that resource is dedicated.

    This description of the the behavior of the Windows Scheduler is also worth my mentioning here because of its significance to my earlier "Mainstream NUMA and the TCP/IP stack" posting. In the next part of "Mainstream NUMA" post, which I hope to have ready in another week or so, I will try to make this connection explicit.

    So, thanks for keeping us honest and stay tuned!

    — Mark

  4. Antonio D says:

    Thank you for the answer. I feel enlightened now!

    Especially about soft affinity. I kind of always worried about that.

  5. One of our main roles in DevDiv Performance Engineering is to help other teams with performance investigations

  6. Mohit Nanda says:

    Thanks Mark about the enlightning details about ‘Soft Affinity’, and the reference to your Win2003 Performance Engineering Handbook was also useful.

    Looking up to next part of "Mainstream NUMA" post.

  7. Soma’s been talking about the upcoming Visual Studio 2010 release on his blog , which means I’m starting

  8. Recently, a colleague of mine, Mark Friedman, posted a blog titled “ Parallel Scalability Isn’t Child’s