SQL Server 2005 has been designed to take advantage of NUMA features exposed by both hardware and the Windows operating system. There are several caveats that you need to be aware when attempting to run SQL Server on NUMA platforms.
In this post I would like to go over both Windows's and SQL Server's NUMA support, describe their possible configuration, and give some advises on troubleshooting some of the issues.
The latest SQL Server's CTP release has most of the NUMA support already built in so that you can play with it and see for yourself actual NUMA in action. If you are more interested in troubleshooting without understanding the actual issues you could jump to troubleshooting section right a way.
Background: I expect you understand classic cc-NUMA configuration so that I don't have to go into details of explaining them.
Windows NUMA Support
NUMA hardware platforms could be configured differently. The two major configurations are pure NUMA or Interleaved NUMA. Pure NUMA will appear to operating system as a set of CPU nodes, sometimes called pods, with local memory. Depending on a hardware manufacture there will be a different penalty to access memory residing on other nodes - remote memory. In this configuration for an application to perform well it needs to minimize a number of remote accesses.
When system is configured to use interleaved memory, to operating system it appears as a large SMP box where memory for cache lines is interleaved from different NUMA's nodes. This configuration is suitable for applications that don't have NUMA optimizations and might behave badly in pure NUMA configuration. For example if you don't have SQL2000 post SP3 QFE that has NUMA optimizations you shouldn't be running SQL Server on the pure NUMA configuration. In general SQL2000 doesn't have many NUMA optimization so that you might want always to configure system as interleaved.
When Windows starts up on pure NUMA hardware it recognizes system multi node configuration and boots accordingly. From our experience we have noticed that during the start OS will allocate memory mostly from a single node - keep in mind that this does depend on memory availability on the nodes as well as a number of applications OS has to start. This allocation pattern during OS startup could be problematic for a NUMA aware application - since memory is not distributed evenly across the nodes and there is no way to find out memory distributions across the nodes. As a system continue running the problem becomes more severe - there is less and less free memory available on the nodes. It could be due to memory hungry applications or to System File Cache, SFC. Depending on the system configuration it is possible for System File Cache to allocate significant amount of memory. Moreover on NUMA configurations it is quite possible for SFC to unevenly use memory across nodes. If a NUMA aware application only attempts to allocate local memory it might become a victim of memory starvation on the nodes in which it collided with SFC or with other memory hungry applications.
Windows exposes a set of APIs that allows applications to take advantage of NUMA:
There are several API's "gotchas" that you have to be aware of:
- There is no explicit way to allocate memory from the given node - a- thread allocates a memory from a given memory node by changing its affinity to the given node and then calling VirtualAlloc (if running with swap file you will have to touch virtual address before it gets bound to physical page) or other low level APIs that will trigger physical page allocation. If OS doesn't have memory on a given node it will serve the memory from the different node. It won't fail the allocation.
- GetNumaAvailableMemoryNode could return 0 even though there could be a still memory that OS could serve from the node.
- Windows will serve memory in the following order: 1. From the node a thread is affinitized to 2. From node that has memory free and 3. From any node - memory freed up from standby lists, working sets and System File Cache.
The consequences of A & C is that NUMA aware applications should have more of special handling for remote memory. A & C might cause OS to return memory in random order such as remote, local, remote, and etc. Keep in mind if applications decides not to cache remote memory it might get in trouble because when memory is freed it gets put on the free list and will be given away on the next allocation request.
The consequences of B is that applications can't reliably decide whether they can or can't allocate memory from the node. Case B usually occurs after system runs for some time especially when large applications are present.
SQL Server 2005 NUMA Support
SQL Server 2005 attempts to take full advantage of NUMA architecture by leveraging SQLOS for more info see http://blogs.msdn.com/slavao/articles/441058.aspx. During startup SQL Server will configure itself base on underlying OS and hardware configuration. It will create the environment inside of itself to mimic the actual hardware - SQL Server will create a software abstraction, for purpose of our discussion we will call it a Node, around every NUMA node and its memory.
Every SQL Server's Node has a memory manager, set of schedulers each of which maps to a underlying CPU; I/O port and other components. SQL Server's Node could be considered as a separate SQL Server instance with its own I/O port that it listens on. Clients or separate applications can be configured to connect to a specific Node only. A Node can be online - accepting new work or offline, accepting neither new work or connections. State of a Node could be changed on the fly using affinity settings - see setting affinity below. A connection could be made to a given Node or set of Nodes - a network port could be associated with multiple Nodes. If no extra configuration specified a new connection is assigned to a next Node in a round robin fashion. If connection is made to a set of Nodes it will be assigned to a next Node in a round robin fashion across that set of Nodes. BOL has plenty of information on how to configure SQL Server's ports, nodes and clients to connect. Once connection is created it is bound to a Node until disconnected. If connection's Node is moved offline its work will be scheduled on other Nodes. All the memory that is required to serve the connection's needs is allocated locally from the memory that is attached to the Node. Database pages are always allocated locally from the Node's memory they accessed on unless they have been brought into the memory by connection assigned to another Node. Please keep in mind that in the latest CTP SQL Server attempts not to allocate memory from different Node. This behavior might cause premature OOM or significant query slow down.
To support different NUMA features SQL Server can be configured both on server and client site. On the server site DBAs can configure amount of memory, active Nodes, a number of threads, locked pages, network configuration - what NIC is bound to what Node. On the client side DBA can configure particular clients/applications to connect to specific Nodes.
Below are two settings that affect SQL Server behavior on NUMA
- Affinity - Using sp_configure a DBA can easily change SQL Server affinity. The major difference between SQL Server 2000 and SQL Server 2005 is that affinity changes will take affect right away - no server restart is required. As you know from SQL 2000 if affinity is not set SQL Server will affinitize itself to all CPUs. In SQL Server 2005 when running on NUMA and affinity is not set SQL Server will affinitize each scheduler to a set of CPUs that a SQL Server Node maps to. For example for a NUMA node 1 (nodes are numbered starting with 0) with 4 CPUs, SQL Server's schedulers belonging to Node 1 will have affinity of 0xF0. When DBA does specify affinity, we call it hard affinity, every scheduler in a Node will be affinitized to its own CPU. For example in the above example first scheduler will have affinity 0x10. When a Node's scheduler maps to an underlying CPU we say that scheduler is online. It is possible for a scheduler to be offline. For example lets say on two NUMA node system, DBA sets server affinity to be 0x1F - only 5 schedulers a mapped to actual CPUs; 4 from the Node 0 and 1 from the Node 1. Last 3 schedulers from the Node 1 are moved offline - no new work will be submitted to them. If all schedulers from a given Node are offline then the whole Node is offline - it won't get any of work assigned moreover no new connections will be made to that Node.
- Max Server Memory - These are the global settings. In the current CTP SQL Server 2005 will continue allocate memory from a given node until either node still has memory - allocation returns a memory from requested node or overall allocated memory doesn't exceed max server memory. SQL Server won't try to allocate/use remote memory - the issue will be addressed in the final version of SQL Server 2005.
Troubleshooting problems related to SQL Server running on NUMA
A query runs sporadically slow even if plan doesn't change or clients observes other significant perf degradations - is the most common problem. The major cause of the problem is memory exhaustion on the Node query is running on. The memory exhaustion could happen due to multiple reasons:
- Node 0 problem - All memory on this node is consumed during OS startup.
Possible solution for this problem is to change SQL Server affinity to move Node 0 offline.
- Node n problem - All memory on this node is consumed by other applications or by System File Cache (also see misconfigured SFC problem below). You can verify if you are hitting this problem by looking at the size of SFC, other applications' memory consumption and amount of memory consumed by SQL Server on every Node from dbcc memorystatus.
Possible solution for this problem is to grab all the memory for SQL Server before starting other applications. You can achieve it by utilizing max server memory, dynamic scheduling and clients affinity. The idea is to start one Node at a time, connect to the node, apply load to allocate memory on the Node, start next Node, apply load, force memory allocation and so on. At the end set min and max memory to be the same so that SQL Server doesn't shrink its memory usage under the load. Here are the steps you need to make to achieve it:
- Configure SQL Server to run on multiple Nodes with a port per Node
- Configure a client for each Node
- Prepare large query, for example reindex, that will require at least DesirableMemory/#Nodes of memory
- Start SQL Server
- Move all Nodes offline except for Node 0 using sp_configure
- Set "max server memory" to DesirableMemory/#Nodes using sp_configure
- Connect to Node 0
- Run the job
- Increase "max server memory to "max server memory" + DesirableMaxMemory/#Nodes using sp_configure
- Move next Node online using sp_configure
- Repeat 7-10 for the rest of the Nodes
- Set "max server memory" = "min server memory" = DesirableMaxMemory
- Misconfigured System File Cache problem - This problem is related to Node n problem. If FSC is incorrectly configured it can consume all the memory on the machine. You can verify if you are hitting this issue by looking at the size of FSC through either TaskManager or perfmon. You can fix the issue by going to "My Computer"->"Properties"->"Advanced"->"Performance Settings"-> "Advanced"->"Memory Usage Programs"->"Ok" and reboot the box, if you changed the setting. It is possible for the applications even when configured for Programs to force large consumption in System File Cache. If it happens you could write an application that periodically trims the size of System File Cache. There is one available from SysInternals http://www.sysinternals.com/Utilities/CacheSet.html (I haven't tried it lately so use it at your own risk :-))
Your comments are welcome!