SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048

Applies To:  SQL 2008, 2008 R2 and Denail builds

The SQL Server developer can elect to partition memory allocations at different levels based on the what the memory is used for.   The developer may choose a global, CPU, Node, or even worker partitioning scheme.   Several of the allocation activities within SQL Server use the CMemPartitioned allocator.  This partitions the memory by CPU or NUMA node to increase concurrency and performance.  

You can picture CMemPartitioned like a standard heap (it is not a HeapCreate) but this concept is the same.  When you create a heap you can specify if you want synchronized assess, default size and other attributes.   When the SQL Server developer creates a memory object they indicate that they want things like thread safe access, the partitioning scheme and other options.

The developer creates the object so when a new allocation occurs the behavior is upheld.  On the left is a request from a worker against a NODE based memory object.  This will use a synchronization object (usually CMEMTHREAD or SOS_SUSPEND_QUEUE type) at the NODE level to allocate memory local to the workers assigned NUMA NODE.   On the right is an allocation against a CPU based memory object.  This will use a synchronization object at the CPU level to allocate memory local to the workers CPU.

In most cases the CPU based design reduces synchronization collisions the most because of the way SQL OS handles logical scheduling.  Preemptive and background tasks make collisions possible but CPU level reduces the frequency greatly.  However, going to CPU based partitioning means more overhead to maintain individual CPU access paths and associated memory lists.  

The NODE based scheme reduces the overhead to the # of nodes but can slightly increase the collision possibilities and may impact ultimate, performance results for very specific scenarios.  I want to caution you the scenarios encountered by Microsoft CSS have been limited to very specific scopes and query patterns.



Newer hardware with multi-core CPUs can present more than 8 CPUs within a single NUMA node.  Microsoft has observed that when you approach and exceed 8 CPUs per node the NODE based partitioning may not scale as well for specific query patterns.   However, using trace flag 8048 (startup parameter only requiring restart of the SQL Server process) all NODE based partitioning is upgraded to CPU based partitioning.   Remember this requires more memory overhead but can provide performance increases on these systems.


The issue is commonly identified by looking as the DMVs dm_os_wait_stats and dm_os_spin_stats for types (CMEMTHREAD and SOS_SUSPEND_QUEUE).   Microsoft CSS usually sees the spins jump into the trillions and the waits become a hot spot.   

Bob Dorr - Principal SQL Server Escalation Engineer

Comments (10)

  1. Jungsun says:

    Do you mean exactly the number of CPU Sockets within a single NUMA node, not the number of Cores, Don't you?

    Thank you for greate information. 🙂

  2. MAbair34 says:

    I believe the article meant to reference sys.dm_os_spinlock_stats instead of dm_os_spin_stats

  3. RDORR says:

    Answering the question(s).

    I meant the number of CPUs presented for a given node.   This includes cores and HT CPUs represened in the same NUMA node.

    Yes dm_os_spinlock_stats – good catch.

  4. wayne says:

    at what rate of spinshour or day would this be considered a bottleneck. wait stats can be skewed by other wait types far exceeding the ones mentioned in this article.  

  5. RDORR says:

    For these issues we would see a big jump (trillions of spins) in a small windows (5 / 10 minutes) that was really unusual for the normal running of the system.

  6. Lonny Niederstadt says:

    "Remember this requires more memory overhead"…

    What can be captured before and after implementing trace flag 8048 to indicate the overhead on a given system?  I've implemented trace flag 8048 on a number of DSS systems, and its been beneficial on most of them.  But I haven't figured out where the overhead of per-core memory allocation vs per NUMA node memory allocation would show up.  Somewhere in memory clerk stats, or purely CPU consumption?  Cross-posting the question to Rohit Nayak's recent post on Diagnosing CMEMTHREAD Waits…


  7. Brian White says:

    " I want to caution you the scenarios encountered by Microsoft CSS have been limited to very specific scopes and query patterns."

    In my experience the pattern I've seen this with is a batch reporting model, where SAP or SSRS sends hundreds (or thousands) of queries to SQL Server in a very short period of time.

  8. Sql_handle says:

    1. In SQL Server 2008 R2 and SQL Server 2012, trace flag 8048 no longer seems to be a startup only trace flag. I'm not sure it's safe to change memory serialization strategy while a workload is in place, but I've seen it take place in a spinlock convoy situation and it had the desired performance impact.

    2. Memory serialization spinlock contention at the NUMA node or 'scheduler group' level can occur with less than eight cores per socket, and can occur even in a six vcpu vm.  The necessary ingredients are enough queries/threads competing for the spinlock resource while stealing memory against query memory grant.  Each allocation against the grant is a potential for spinlock contention.

    3. VMs consolidated on a physical server can be even more susceptible than physical servers to this spinlock, if competing for physical CPU time with other guests.  Time waiting for physical CPU can exaggerate the hold time of spinlock resources, increasing the time window for contention in each possession.

  9. Joe O'Brien says:

    We have a server with 1500 users databases and an awful adhoc workload. We are on 2008 and have 10 logical cores per socket/numa node. We see a large percentage of CMEMTHREAD waits (75%). BUT as far as spinlock stats  – seeing SOS_TLIST as the largest spinner. This is a delta on a 10 minute sample:

    Spinlock Collisions Spins SpinsPerCollision SleepTime Backoffs

    SOS_TLIST 4673403 2274201787 486 2877 36028

    LOCK_HASH 7391 6823378 923 0 893

    BUF_FREE_LIST 5409 2324733 429 7 154

    SOS_OBJECT_STORE 3036 66379 21 0 3

    SOS_SUSPEND_QUEUE 1801 588115 326 1 42

    How do I determine if this trace flag is appropriate?

  10. JLowell says:

    A couple of quick questions:

    1.) Can you please be more specific on the statement "…the scenarios encountered by CSS have been limited to very specific scopes and query patterns" – if that's the case, which scopes and query patterns?

    2.) In the comments SQL_handle brings up a few very interesting points – can you please respond to them?

    3.) The article states that this TF requires more memory overhead – how much more?  I suspect that it will depend on the amount of memory on the server / NUMA node, so a percentage swag would be terrific.   For example, a server with 512GB of ram will require less overhead than a server with 2TB of ram – but the percentage could be important here….  

    I ask these questions because "trying" this in a production workload – even where we see high spins, does involve an element of risk and many of us don't have the ability to just do it and then back it out with the limited information you've provided in this post.  

    Thanks in advance

Skip to main content