NUMA and you, perfect together (Part 1)

I know this is a slightly more esoteric topic, even for me, but I want to address cc:NUMA platforms, and how they matter to Windows and Windows applications.  What is NUMA you ask?  NUMA stands for Non-Uniform Memory Architecture.  (The cc: stands for Cache Coherent, by the way, because there is non-cache coherent NUMA as well, but I won’t address that here since there are no Windows support platforms that are non-cache coherent.)

To understand why NUMA exists, we need to look at Symmetric Multiprocessing (SMP).  SMP has a few core principles it is built around, and one is that every CPU in the system has an identical view of the system. Memory, I/O subsystem, and other CPU’s can all be treated the same by software.  The problem comes when this assumption is no longer true.  As you scale up the size of a system, it becomes harder and harder to keep everything close together, literally.  The more switches and busses your data flows through, the longer it takes.

This fact means that in order to squeeze the maximum amount of performance out of the system, it behooves the OS as well as the programmer to try and keep data as close to the place where it’s needed as possible.  By keeping track of which pages of memory and CPU’s have the best locality to each other, decisions can be made when threads are scheduled and memory allocated that will squeeze that extra little bit out of the system.

Until only a few years ago, this was exclusively the realm of large mainframe style computers, not the PC world.  But with the introduction of the Unisys ES7000 in 2000, the PC suddenly had something to benefit by being NUMA aware.  Even then, this was something that mostly concerned large scale-up server implementations, not the average user or programmer.  That is, until AMD announced their unique implementation of their new Opteron and Athlon64 processors.  Suddenly, any system that has more than one of those CPUs could potentially benefit from NUMA optimizations.  I’ll go into why in the next entry.

Comments (6)

  1. Simon says:

    I need to thank whoever implemented NUMA in xp sp2 – I get over 10GB/s memory bandwidth on my dual Opteron!

    (SiSoft Sandra benchmark on my blog till it rolls off – but I’m on a pocket pc and can’t get the permalink – sorry!)

  2. I didn’t realize that had happened. I’ve been too busy playing with the amd64 port, which is awesome, by the way. Any idea if NUMA is in that build?

  3. Skywing says:

    AFAIK, the Win32 APIs to use NUMA smartly from within an app are only exposed on Windows Server 2003 or later.

  4. Simon says:

    Yes, Windows XP Service Pack 2 has NUMA support.

    I have it installed and running and you can’t tell me 10GB/s of memory bandwidth isn’t from NUMA – It’s basically using the memory closest to the CPU rather than having to pipe data through the cross CPU hypertransport link.

  5. amd64 == 3790 kernel, same as server 2003. I bet the APIs are there too but i’m too lazy to look atm.