Running SQL Server on Machines with More Than 8 CPUs per NUMA Node May Need Trace Flag 8048

Article
03/02/2015

Applies To: SQL 2008, 2008 R2, 2012 and 2014 releases

Note: The number of CPUs is the logical count, not sockets. If more than 8 logical CPUs are presented this post may apply.

The SQL Server developer can elect to partition memory allocations at different levels based on the what the memory is used for. The developer may choose a global, CPU, Node, or even worker partitioning scheme. Several of the allocation activities within SQL Server use the CMemPartitioned allocator. This partitions the memory by CPU or NUMA node to increase concurrency and performance.

You can picture CMemPartitioned like a standard heap (it is not a HeapCreate ) but this concept is the same. When you create a heap you can specify if you want synchronized assess, default size and other attributes. When the SQL Server developer creates a memory object they indicate that they want things like thread safe access, the partitioning scheme and other options.

The developer creates the object so when a new allocation occurs the behavior is upheld. On the left is a request from a worker against a NODE based memory object. This will use a synchronization object (usually CMEMTHREAD or SOS_SUSPEND_QUEUE type) at the NODE level to allocate memory local to the workers assigned NUMA NODE. On the right is an allocation against a CPU based memory object. This will use a synchronization object at the CPU level to allocate memory local to the workers CPU.

In most cases the CPU based design reduces synchronization collisions the most because of the way SQL OS handles logical scheduling. Preemptive and background tasks make collisions possible but CPU level reduces the frequency greatly. However, going to CPU based partitioning means more overhead to maintain individual CPU access paths and associated memory lists.

The NODE based scheme reduces the overhead to the # of nodes but can slightly increase the collision possibilities and may impact ultimate, performance results for very specific scenarios. I want to caution you the scenarios encountered by Microsoft CSS have been limited to very specific scopes and query patterns.

Newer hardware with multi-core CPUs can present more than 8 CPUs within a single NUMA node. Microsoft has observed that when you approach and exceed 8 CPUs per node the NODE based partitioning may not scale as well for specific query patterns. However, using trace flag 8048 (startup parameter only requiring restart of the SQL Server process) all NODE based partitioning is upgraded to CPU based partitioning. Remember this requires more memory overhead but can provide performance increases on these systems.

HOW DO I KNOW IF I NEED THE TRACE FLAG?

The issue is commonly identified by looking as the DMVs dm_os_wait_stats and dm_os_spinlock_stats for types (CMEMTHREAD and SOS_SUSPEND_QUEUE). Microsoft CSS usually sees the spins jump into the trillions and the waits become a hot spot.

Caution: Use trace flag 8048 as a startup parameter. It is possible to use the trace flag dynamically but limited to only memory objects that are yet to be created when the trace flag is enabled. Memory objects already built are not impacted by the trace flag.

References

https://blogs.msdn.com/b/psssql/archive/2012/12/20/how-it-works-cmemthread-and-debugging-them.aspx

Bob Dorr - Principal SQL Server Escalation Engineer

Running SQL Server on Machines with More Than 8 CPUs per NUMA Node May Need Trace Flag 8048

Additional resources