A performance issue involving CLR garbage collection and SQL CPU affinity setting

We have been troubleshooting a customer’s case and uncovered a GC behavior with SQL Server CPU affinity that is worth sharing here in a blog.

Problem

Customer reported that they had two instances of SQL Server 2008 running on a two-node cluster.  Let’s call them Instance1 and Instance2.  When they run both instances on the same node, kicking off a job on Instance1 will cause query timeouts on Instance2. The queries timed out were CLR queries via spatial usage.

The node had plenty of memory for both instances.  Each instance had max server memory set.  Additionally, Instance1 and Instance2 had CPU affinity set so that they don’t overlap CPU usage.

 

Analysis

The puzzle here was that Instance2 was negatively impacted by a job run by Instance1 though on the surface it shouldn’t have been.  The system had plenty of memory and CPU’s were divided by CPU affinity settings. When Instance1’s job was not running or running on a different node, Instance2 would work perfectly fine.

We captured data such as DMV data, perfmon, userdump etc via pssdiag during problem period (when both instances ran on the same node and Instance1 kicked off the offending job).

In sys.dm_exec_requests, we saw high waits on  CLR_MANUAL_EVENT and CLR_CRST for spatial queries which used CLR on Instance1. 

From perfmon and userdump, we were able to see the threads were waiting for garbage collection to finish.   When the issue occured, time spent by GC was greatly increased.

For a while, we were puzzled why Instance1 would impact Instance2 by increasing it’s GC time spent.   The system had plenty of free memory and CPU affinity was set so that two instances didn’t overlap CPU usage.

 

We did notice that a few CPUs affinitized to Instance1 were pegged to 100% for a long time.  Note that these CPUs should not have been used by Instance2 because CPU affinity was set. 

One break-through came in when I noticed  that number of GC threads seemed to be odd.  SQL CLR uses server GC.   For each CPU, there will be a CLR heap and GC thread created.   What I noticed was that, even Instance1 had 16 CPUs affinitized to it, there were 64 GC threads (customer had 64 CPU on each node).  The same was true for Instance2 ( 64 GC threads with affinity set to 30 CPUs).

Here was a little discovery:

When you set CPU affinity on SQL Server,  it doesn’t actually set process affinity.  Instead the affinity is set at thread creation time to a specific CPU/scheduler.  Doing this way  will help enable dynamic CPU affinity.

But CLR initializes heaps and  GC threads based on process affinity.  It will create number of heaps and GC threads depending on the number of CPUs set by process affinity.  If no process affinity is set, it will create number of heaps and GC threads based machine CPU.

In other words, even though Instance2 was affinitized (at SQL level) to use CPUs 16-45, some GC threads still run on CPUs 0-15 (where instance1 ran).

This is where ‘interaction’ comes in.  Basically the CLR heaps and GC threads are not isolated at all.  For example, even CPU 1 was configured for Instance1, there will be one CLR heap and GC thread for Instance2 running on that CPU as well.

Since there were quite a few of CPUs run by Instance1 were pegged, this caused GC threads from Instance2 to fight CPU quantum, leading to increased GC time.

 

Conclusion & Solution

To summarize, if you have multiple instances of SQL Server on the same machine,  garbage collection from one instance may be impacted by another instance  even you have CPU affinity set to isolate the instances.

There are a couple of things you can do:

When you examine your system health, look for individual CPU utilization.  Any individual CPU pegged to 100% for a sustained period of time, you need to work to bring CPU down.  If you have a balanced resource usage, this behavior (GC thread starvation) shouldn’t happen.  SQL CLR has been in use since 2005, this is the first time I have seen this behavior.

If you want true isolation, consider use VM (one instance per VM).

 

A couple of  additional notes:

By the way, the same customer also helped uncover a situation Bob Dorr documented it in the blog titled  “AppDomain unloading messages flooding the SQL Server error log

Note that CLR itself has a bug that can cause GC to misbehave.  This is fixed in https://support.microsoft.com/default.aspx?scid=kb;EN-US;2504603.   This customer was on that build.  If you have GC misbehavior, apply this hotfix first before doing additional troubleshooting.  That’s maybe all you need.

 

 

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support