Be aware: To Hyper or not to Hyper
Our customers observed very interesting behavior on high end Hyperthreading (HT) enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades. Occasionally they would also see message in errorlog indicating one of the threads can't acquire a spinlock (Spinlock is a lightweight synchronization data structure. Threads keep a spinlock for short period of time. If a thread can't acquire a spinlock it will spin in a tight loop waiting for spinlock to become available). Customers also noticed that when HT is disabled under the same circumstances CPU usage increases slightly and performance is at appropriate level. What is going on here?
After some pondering and investigation we came up with the following theory. Database server has different type of threads: worker threads, threads executing clients request, and system threads that performing system tasks. An example of a system thread is a lazywriter. The responsibility of the lazywriter is to lazily scan through database buffers and push least recently used buffers out of the cache. Lazywriter usually kicks in whenever data page cache is full and server short of free data page buffers. The interesting behavior of lazywriter is that it scans large amount of memory. With Intel HT technogolgy logical processors share L1 & L2 caches. As you would guess lazywriter's behavior can potentially trash L1 & L2 caches. If a worker thread end up running on logical CPU that shares physical CPU with lazywriter its cache will be constantly trashed. It means that most of the memory accesses for the worker thread will be L1 & L2 misses. Moreover whenever the worker attempts to access hot data structure protected by spinlock it potentially might have a less chance of acquiring the spinlock which will cause the worker's spinning. This behavior will translate to high CPU utilization and significant drop in performance.
To confirm the theory I decided to create an experiment. I wrote a test program which consist of two types of threads: scan threads and lock threads. Scan threads are scanning through large amounts of memory. Lock threads constantly attempt to acquire global lock. If the theory is correct a lock thread that shares a physical CPU with a scan thread should acquire lock much less often than other threads in a given period of time. Below is the program's listing. (Disclosure: The program is for illustrative purpose only, it was written to illustrate the observed behavior, it shouldn't be treated as a final-well tested product :-))
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#include <eh.h>
#include <wtypes.h>
#include <locale.h>
#include <process.h>
extern "C" void _mm_pause(void);
#pragma intrinsic (_mm_pause)
inline void
SmtPause ()
{
#if defined (_X86_) || defined (_AMD64_)
// PAUSE instruction (REP NOP) helps SMT scheduling
//
_mm_pause ();
#endif // _X86_
}
// Statistic information for threads
//
struct Statistics
{
int locked;
int spinned;
DWORD_PTR affinity;
};
// Type of threads the tests using
//
enum ThreadTypes
{
SCAN_THREAD,
LOCK_THREAD,
LAST_THREAD
};
// Entry Point
//
typedef void( __cdecl *ThreadEntryPoint )( void * );
// Inforamtion per thread - enables us to create threads in a common way
//
struct ThreadInfo
{
ThreadEntryPoint func;
LONG_PTR affinity;
};
// Global constants
//
// Number of CPUs
//
const int MaxNumberOfCPUs = 64;
// Block of memory to scan by scan thread
//
char MemoryBlock [1024 *1024];
// Stat information: number of threads is limmited by number of CPU,
// that is the purpose of the test
//
volatile Statistics Stats [MaxNumberOfCPUs];
// Number of threads running
//
volatile long NumberOfThreadsRunning = 0;
// "Spinlock" that all lock threads attempt to acquire
//
volatile long Lock = 0;
//-----------------------------------------------------------------------------
// Function: scanThread
//
// Description:
// Attempt to invalidate caches
//
// Notes:
// 1. Function changes 1MB of contiguous vm memory which appear
// to be enought invalidate L1&L2 caches
//
void __cdecl
scanThread (void* param)
{
char i = 0;
ULONG_PTR threadAffinity =(DWORD_PTR) param;
long threadId = InterlockedIncrement (&NumberOfThreadsRunning) -1;
SetThreadAffinityMask (GetCurrentThread (),threadAffinity);
// Initialize affinity
//
Stats [threadId].affinity = threadAffinity;
while (i >= 0)
{
// Memset memory: quick and dirty scan
//
memset (MemoryBlock, 0, sizeof (MemoryBlock));
Stats [threadId].spinned++;
}
return;
}
//-----------------------------------------------------------------------------
// Function: lockThread
//
// Description:
// Acquires and releases a lock in a loop
//
// Notes:
//
//
void __cdecl
lockThread (void* param)
{
ULONG_PTR threadAffinity =(DWORD_PTR) param;
long threadId = InterlockedIncrement (&NumberOfThreadsRunning)-1;
SetThreadAffinityMask (GetCurrentThread (), threadAffinity);
// Initialize affinity
//
Stats [threadId].affinity = threadAffinity;
while (1)
{
if (Lock == 0 && InterlockedExchange (&Lock, 1) == 0)
{
// Increment number of times we acquired lock
//
Stats [threadId].locked ++;
Lock = 0;
}
else
{
Stats [threadId].spinned++;
// Be HT friendly
//
SmtPause ();
continue;
}
}
return;
}
//-----------------------------------------------------------------------------
// Function:main
//
// Description:
// Creates lock and scan threads, reporting stats every 10sec
//
// Notes:
//
//
int __cdecl
main (
int argc,
char* argv [])
{
LONG_PTR threadId = 0;
ThreadInfo threadInfo [LAST_THREAD];
DWORD_PTR processAffinity;
DWORD_PTR threadAffinity;
DWORD_PTR systemAffinity;
int counter = 0;
int numberOfThreads = 0;
// Retrive process affinity
//
if (GetProcessAffinityMask (GetCurrentProcess (),
&processAffinity,
&systemAffinity) == FALSE)
{
printf ("GetProcessAffinityMask failed - exiting\n");
exit (-1);
}
// Initialize threads Information: First default and then check
// if client passed anything in
//
// Allocate all but one CPUs for threads running locking code
//
threadInfo [LOCK_THREAD].func = lockThread;
threadInfo [LOCK_THREAD].affinity = processAffinity & processAffinity-1;
// Allocate one CPU for scan thread
//
threadInfo [SCAN_THREAD].func = scanThread;
threadInfo [SCAN_THREAD].affinity =threadInfo [LOCK_THREAD].affinity ^ processAffinity;
// Read command line arguments: in simple way
//
for (counter = 0 ; counter< argc; counter++)
{
switch (*(argv [counter] + 1))
{
case 's':
case 'S':
{
// Retrieve affinity mask for scan threads
//
threadInfo [SCAN_THREAD].affinity =
(DWORD_PTR) _atoi64 ((argv [counter] + 2));
break;
}
case 'l':
case 'L':
{
// Retrieve affinity mask for lock threads
//
threadInfo [LOCK_THREAD].affinity =
(DWORD_PTR) _atoi64 ((argv [counter] + 2));
break;
}
case '?':
{
printf ("Usage: test -sn -lm \n"
"Example: test.exe -s8 -l7 (runs 1 scanning thread and 3 locking threads)\n"
"-sn is affinity of threads performing memory scan \n"
"-sm is affinity of threads acquiring lock");
exit (0);
}
default:
{
break;
}
}
}
// Start threads of all types based on supplied affinities
//
for (counter = 0; counter < LAST_THREAD; counter++)
{
while (threadInfo [counter].affinity != 0)
{
// Separate affiinity for a thread we are trying to
// create from all affinity mask
//
// Remember full mask
//
threadAffinity = threadInfo [counter].affinity;
// Turn off least significant bit
//
threadInfo [counter].affinity &= threadInfo [counter].affinity -1;
// Calculate the affinity bit for the thread:
// Leave one bit on - affinity bit
//
threadAffinity ^= threadInfo [counter].affinity;
// Use beginthread for simplicity
//
if (_beginthread (threadInfo [counter].func,
NULL,
(void*) threadAffinity) == -1)
{
printf ("Failed to create thread - exiting\n");
exit (-1);
}
numberOfThreads++;
}
}
// Loop and report stats periodically
//
while (1)
{
// Report every 10 sec
//
Sleep (10000);
for (threadId=0; threadId < numberOfThreads; threadId++)
{
printf ("Thread Id = %d\t, Thread Affinity = %p\t, Locked = %d\t, Spin = %d\t\n ",
threadId,
Stats [threadId].affinity,
Stats [threadId].locked,
Stats [threadId].spinned);
// Reset run time data
//
Stats [threadId].locked = 0;
Stats [threadId].spinned =0;
}
}
return 0;
}
As you can see one can specify an affinity for scan and lock threads at start up of the program.
I ran sever experiments on my dev box which has 2 physical and 4 logical CPUs. CPU 1 (Affinity 0x1) & 3 (Affinity 0x4) and 2 (Affinity 0x2) & 4 (Affinity 0x8) share physical CPUs respectively. The box also has Windows 2003 Server SP1 on it. Below are the results:
test.exe -s8 -l7
Thread Id = 0 , Thread Affinity = 00000008 , Locked = 0 , Spin = 9035
Thread Id = 1 , Thread Affinity = 00000001 , Locked = 9290331 , Spin = 10373717
Thread Id = 2 , Thread Affinity = 00000002 , Locked = 676397 , Spin = 40879794
Thread Id = 3 , Thread Affinity = 00000004 , Locked = 10274030 , Spin = 42433521
test.exe -s4 -l11
Thread Id = 0 , Thread Affinity = 00000004 , Locked = 0 , Spin = 8943
Thread Id = 1 , Thread Affinity = 00000001 , Locked = 741166 , Spin = 11421139
Thread Id = 2 , Thread Affinity = 00000002 , Locked = 10747824 , Spin = 10024252
Thread Id = 3 , Thread Affinity = 00000008 , Locked = 10210221 , Spin = 13988134
test.exe -s1 -l14
Thread Id = 0 , Thread Affinity = 00000001 , Locked = 0 , Spin = 8472
Thread Id = 1 , Thread Affinity = 00000002 , Locked = 10550011 , Spin = 20156206
Thread Id = 2 , Thread Affinity = 00000004 , Locked = 722554 , Spin = 11211074
Thread Id = 3 , Thread Affinity = 00000008 , Locked = 11182506 , Spin = 25376166
test.exe -s2 -l13
Thread Id = 0 , Thread Affinity = 00000002 , Locked = 0 , Spin = 8585
Thread Id = 1 , Thread Affinity = 00000001 , Locked = 9900899 , Spin = 12267885
Thread Id = 2 , Thread Affinity = 00000004 , Locked = 8984095 , Spin = 11909080
Thread Id = 3 , Thread Affinity = 00000008 , Locked = 1297138 , Spin = 28199769
Great! The experiment confirms the theory. So does it mean you have to disable HT when using SQL Server? The answer is it really depends on the load and hardware you are using.
You have to test your application with HT on and off under heavy loads to understand HT's implications.
Keep in mind that not only lazywriter thread can cause slowdown but any thread that performs large memory scan - for example a worker thread that scans large amount of data might be a culprit as well.
For some customer applications when disabling HT we saw 10% increase in performance. So make sure that you do your home work before you decide to hyper on not to hyper :-)
Hope this information was useful. Let me know if you have any questions!