Introduction to NUMA

by Bob Golding

The introduction of NUMA (Non-Uniform Memory Access) required changes in memory management.  Since accessing memory that is not local to the node can result in an access going over an interconnect such as fibre, the memory manager tries to allocate memory locally to avoid the performance cost of remote access.  This is how the mechanism works in 2008.

What is NUMA?

The NUMA architecture is basically a small number of processors each having its own memory and possibly its own I/O channels.  Each processor group can access another group’s memory without worrying about coherency.  Each group of CPUs is called a ‘node’.  Memory that is local to a node is called local or near memory.  Memory outside of a node is called foreign or far memory.  Local memory is on the same node as the group of processors – although we do support configurations where some memory nodes may not have any local CPUs.  Far memory is memory that is local to other nodes, however for a remote node to access the memory it may have to go over an interconnect such as fibre.  This is more expensive so the OS tracks which node the physical memory resides on and uses this information to optimally allocate accordingly.

How is memory tracked?

In versions prior to Windows XP, each memory page had its own ‘color’ or cache locality.  When a memory page was on the Free or Zeroed list, it also was on a color list.  This mechanism was enhanced so the color includes both the processor node number and each node has a number of colors.  When the system is initialized the memory manager calls a function called HalpNumaQueryPageToNode to get the node number for a physical address.

How is the memory organized?

There are two lists; one is the Zeroed Memory list and the other is the Free Memory list.  For example:

nt!MmFreePagesByColor = struct _MMCOLOR_TABLES *[2]

This location has two pointers that point to the Free and Zeroed color lists.  The first address in the array is a pointer to the Zeroed color list.  The other points to the Free color list.  Each entry looks like this:

nt!_MMCOLOR_TABLES

   +0x000 Flink : Uint8B <<-- Page #

   +0x008 Blink : Ptr64 Void <<-- PFN address

   +0x010 Count : Uint8B

To find out how many tables there are you need to look at a number of locations.  The example below is from a system with 4 nodes:

nt!MmSecondaryColorNodeShift = 0x3

nt!MmSecondaryColorMask = 7

MmSecondaryColorNodeShift is how many times a color is shifted to get the offset to the first color for the node.  The other, MmSecondaryColorMask is the mask that is used to get the color within the node.  The mask is used to mask the page number and the result is OR’d into the shifted result to get the offset.

Can I have an example?

Ok, for example, let’s take the page 86152d which has been assigned to Node 1:

14: kd> !pfn 86152d

    PFN 0086152D at address FFFFFA801923F870

    flink 0086152C blink / share count 0086152E pteaddress FFFFF6FC0430A968

    reference count 0000 used entry count 0000 Cached color 1 Priority 0

    restore pte 00861525 containing page FFFFFFFFFFFFF Zeroed

For the forward and backward links for the color table, the restore PTE is the forward (861525) and the containing page is the back link  (-1).

So to get the offset into the color table use ( 1 << 3 | ( 0x7 & 86152D) * 0x18 (0x18 is the size of each color table entry):

14: kd> dq fffffa80`317fffd0+(18*d) <<-- fffffa80`317fffd0 is the start of the Zeroed color table

fffffa80`31800108 00000000`0086152d fffffa80`18bd5670

fffffa80`31800118 00000000`0000445c 

Are there any debugger extensions that will help with this?

To get a display of what memory belongs to what node use !numa_hal:

14: kd> !numa_hal

HAL NUMA Summary

----------------

    Node Count : 4

    Processor Count : 16

    Node ProximityId

    ------------------

    0x00 0x00000000

    0x01 0x00000001

    0x02 0x00000002

    0x03 0x00000003

    Proc Domain APIC Id

    ---------------------------

    0x00 0x00000000 0x00000000

    0x01 0x00000000 0x00000001

    0x02 0x00000000 0x00000002

    0x03 0x00000000 0x00000003

    0x04 0x00000001 0x00000004

    0x05 0x00000001 0x00000005

    0x06 0x00000001 0x00000006

    0x07 0x00000001 0x00000007

0x08 0x00000002 0x00000008

    0x09 0x00000002 0x00000009

    0x0A 0x00000002 0x0000000A

    0x0B 0x00000002 0x0000000B

    0x0C 0x00000003 0x0000000C

    0x0D 0x00000003 0x0000000D

    0x0E 0x00000003 0x0000000E

0x0F 0x00000003 0x0000000F

    Domain Range

    -----------------

    0x00000000 0x0000000000000000 -> 0x0000000480000000

    0x00000001 0x0000000480000000 -> 0x0000000880000000

    0x00000002 0x0000000880000000 -> 0x0000000C80000000

    0x00000003 0x0000000C80000000 -> 0xFFFFFFFFFFFFFFFF

As you can see from above the memory is assigned linearly by node.  What kind of problem do you think MmAllocatePagesForMdlEx will cause if highest address was fffff000 and it ran on CPU 9?  What if a number of requests ran on all nodes except 0?

Epilog

The question asked above actually happened.  The answer is the machine would ‘pause’ for a period of time as it futilely searched node 2’s memory to find pages to satisfy the request (before eventually searching the other nodes).  That is the issue that we worked on that resulted in this research.  I hope this gives a better understanding of NUMA and memory management.

Bob Golding has been with Microsoft since 1997. He is a Senior Escalation Engineer on the Global Escalation Services team where he supports Microsoft's largest customers with their most critical issues. Bob can be reached at rgolding@microsoft.com. For more information about debugging Windows, visit https://blogs.msdn.com/ntdebugging.