Thoughts on Intel's recent hardware announcements

Intel briefed customers recently about the evolution of its processor architectures to support ManyCore processors. Highlights of the press briefing include announcing the quad-core Tukwila processor that supports the IA-64 Itanium architecture and a six-core x64-based processor called Dunnington that will be available later this year. The major focus of the announcement though was the new Nehalem architecture processors which are scheduled for production by the end of this year.

Intel is executing on an aggressive, two-year product cycle, driven primarily by semiconductor fabrication improvements in the first year of the cycle that double the amount of circuitry they can fit on a chip. These are followed up by architectural improvements designed to leverage the new chip density in the year following.

Last year, this involved bringing online a new 45-nanometer (nm) fabrication plant that utilizes a new technology, namely, hafnium-based high-k gate dielectrics and metal gates. This technology is designed to address a serious problem with the previous materials used to insulate the tiny circuitry etched into the substrate. Historically, the material used for this insulation has been silicon dioxide. As the dimensions of the circuit have shrunk with each succeeding generation of semiconductor fabrication technology, so has the insulation layer. At the 90 nm point, the silicon dioxide insulator was the width of just 5 atoms. Material scientists felt it could not be shrunk any further without ceasing to function as an effective insulator. At that point, it was also subject to leaking power significantly (and generating excess heat). It started to look like we had reached a physical limit on how much circuitry could be crammed onto a single chip.

Intel is championing the new materials and manufacturing processes as a breakthrough that will enable it to build increasingly denser chips. (If high-k gate dielectrics really is a breakthrough, the rest of the semiconductor industry can be expected to follow.) The Tukwila and Dunnington processors announced last week each contain close to 2 billion logic circuits. Intel expects that the another two-year product cycle will hit like clockwork next year when it moves to a next generation 32-nm fab. This will once again double the number of circuits that can be packed on a chip to 4 billion. Enter the new processor architecture, code-named Nehalem designed to exploit the new fabrication density.

The Nehalem architecture is a modular design that can be produced today on the 45 nm process, but, can migrates next year to the 32 nm process. This year it will be built with 4 processor cores, next year 8. Consistent with the imperative to conserve power, it does not look like the Nehalem will increase the CPU clock speed. At least there was nothing about a faster clock in the announcement materials.

The emphasis on conserving power is leading to an increased interest in utilizing Simultaneous Multi-threading (SMT) technology to boost processor performance. This is something Intel branded initially as Hyper-Threading (HT). It is nice to see Intel moving away from the original, marketing-oriented branding, which was confusing, to the generic and generally accepted terminology. The announcement hints at expanding the use of SMT in future Nehalem chips, which is something the original SMT research found to be very promising. We could get 8 processor cores on the 32 nm fab process next year, each supporting up to 4 logical processors on a desktop machine. I don’t know if that will happen because it is not clear that desktop machines need anything close to 32 processors. That’s the challenge that software development has to step up to because right now, as I blogged about last time, the current generation of desktop software cannot effectively utilize all those processors.

One of the performance issues that arose with HT in earlier Intel processors, especially with server workloads, was that the added logical processors had a tendency to saturate the Front-side bus (FSB) that connected the processors to the memory controller. (Here’s an anecdotal example: https://www.cmg.org/measureit/issues/mit15/m_15_2.html.) The bus accesses are needed so that the snooping protocol can maintain cache coherence in a multiprocessor. The new QuickPath Interconnect provides a scalable alternative to the FSB that has both better latency and higher bandwidth. This should help bring SMT into the mainstream for server workloads. Architecturally, QuickPath looks very similar to AMD’s HyperTransport (another “HT,” which may explain why Intel has reverted to using the generic SMT terminology).

In the QuickPath architecture, each microprocessor contains a built-in memory controller designed to access its own dedicated local memory. But on machines configured with more than one microprocessor, QuickPath leads to a NUMA architecture. Within a microprocessor, all the processor cores and their logical processors can access local memory at a uniform speed using the integrated memory controller. But access to remote memory on a different microprocessor is slower. A program thread running on one microprocessor can access remote memory attached to another microprocessor using the QuickPath Interconnect.

These are the basic performance characteristics associated with NUMA (non-uniform memory access) machines. NUMA used to be associated with mainly esoteric high-end super-computers, mainly due to the difficulty developers had in programming for them. Even so, NUMA is poised to become a mainstream architecture, which is another serious challenge for our software frameworks.

To run well, a multi-threaded program running on a NUMA machine needs to be aware of the machine environment and understand which memory references are local to the node and which are remote. A thread that was running on one NUMA node that migrates to another node pays a heavy price every time it has to fetch results from remote memory locations. The Windows OS is already NUMA-aware to a degree. Once dispatched, threads have node affinity, for example. And Windows OS memory management is also NUMA-aware. The OS resists migrating threads to another node and otherwise also tries to ensure that most memory accesses remain local using per node memory management data structures. There are a number of NUMA-oriented APIs that applications can use to keep their threads from migrating off-node and direct memory allocations to a specific physical processing node.

In the meantime, it is an open question how NUMA-aware the application level run-times and critical memory management functions like the GC in .NET need to be. Let’s assume for a moment that 8 processor cores is more than enough for almost any desktop. This means dealing with the complexities of NUMA is confined, in the short term at least, to server applications. While that is occasion for a big sigh of relief, it is also a pointed reminder that facilities like the new Task Parallel Library in the .NET Framework will need to become NUMA-aware.

Finally, it is worth mentioning that there are a bunch of other goodies in the recent Intel announcement to further fuel the drive to many-core processors . These include making the serializing instructions like XCHG run faster, which should be a boost to multi-threaded programs of all stripes. Intel is also adding a shared L3 cache to each chip; each processor core continues to have its own dedicated L1 and L2 caches.