Be aware: To Hyper or not to Hyper


Our customers observed very interesting behavior on high end Hyperthreading (HT) enabled hardware. They noticed that in some cases when high load is applied SQL Server CPU usage increases significantly but SQL Server performance degrades. Occasionally they would also see message in errorlog indicating one of the threads can’t acquire a spinlock (Spinlock is a lightweight synchronization data structure. Threads keep a spinlock for short period of time. If a thread can’t acquire a spinlock it will spin in a tight loop waiting for spinlock to become available). Customers also noticed that when HT is disabled under the same circumstances CPU usage increases slightly and performance is at appropriate level.  What is going on here?

 


After some pondering and investigation we came up with the following theory.  Database server has different type of threads: worker threads, threads executing clients request, and system threads that performing system tasks. An example of a system thread is a lazywriter. The responsibility of the lazywriter is to lazily scan through database buffers and push least recently used buffers out of the cache.  Lazywriter usually kicks in whenever data page cache is full and server short of free data page buffers. The interesting behavior of lazywriter is that it scans large amount of memory. With Intel HT technogolgy logical processors share L1 & L2 caches. As you would guess lazywriter’s behavior can potentially trash L1 & L2 caches. If a worker thread end up running on logical CPU that shares physical CPU with lazywriter its cache will be constantly trashed.  It means that most of the memory accesses for the worker thread will be L1 & L2 misses. Moreover whenever the worker attempts to access hot data structure protected by spinlock it potentially might have a less chance of acquiring the spinlock which will cause the worker’s spinning. This behavior will translate to high CPU utilization and significant drop in performance. 


 


To confirm the theory I decided to create an experiment. I wrote a test program which consist of two types of threads: scan threads and lock threads. Scan threads are scanning through large amounts of memory. Lock threads  constantly attempt to acquire global lock.  If the theory is correct a lock thread that shares a physical CPU with a scan thread should acquire lock much less often than other threads in a given period of time. Below is the program’s listing. (Disclosure: The program is for illustrative purpose only, it was written to illustrate the observed behavior, it shouldn’t be treated as a final-well tested product :-))


 


#include <stdio.h>


#include <stdlib.h>


#include <windows.h>


#include <eh.h>


#include <wtypes.h>


#include <locale.h>


#include <process.h>


 


extern “C” void _mm_pause(void);


#pragma intrinsic (_mm_pause)


inline void


SmtPause ()


{


#if defined (_X86_) || defined (_AMD64_)


// PAUSE instruction (REP NOP) helps SMT scheduling


//


_mm_pause ();


#endif // _X86_


}


// Statistic information for threads


//


struct Statistics


{


int    locked;


int    spinned;


DWORD_PTR    affinity;


};


// Type of threads the tests using


//


enum ThreadTypes


{


SCAN_THREAD,


LOCK_THREAD,


LAST_THREAD


};


 


// Entry Point


//


typedef void( __cdecl *ThreadEntryPoint )( void * );


// Inforamtion per thread – enables us to create threads in a common way


//


struct ThreadInfo


{


ThreadEntryPoint    func;


LONG_PTR    affinity;


};


// Global constants


//


// Number of CPUs


//


const int    MaxNumberOfCPUs = 64;


 


// Block of memory to scan by scan thread


//


char     MemoryBlock [1024 *1024];


 


// Stat information: number of threads is limmited by number of CPU,


// that is the purpose of the test


//


volatile Statistics    Stats [MaxNumberOfCPUs];


 


// Number of threads running


//


volatile long    NumberOfThreadsRunning = 0;


 


// “Spinlock” that all lock threads attempt to acquire


//


volatile long     Lock = 0;


 


//—————————————————————————–


// Function: scanThread


//


// Description:


//    Attempt to invalidate caches


//


// Notes:


//    1. Function changes 1MB of contiguous vm memory which appear


//    to be enought invalidate L1&L2 caches


//


void __cdecl


scanThread (void* param)


{


char i = 0;


ULONG_PTR    threadAffinity =(DWORD_PTR) param;


long         threadId = InterlockedIncrement (&NumberOfThreadsRunning) -1;


SetThreadAffinityMask (GetCurrentThread (),threadAffinity);


// Initialize affinity


//


Stats [threadId].affinity = threadAffinity;


while (i >= 0)


{


// Memset memory: quick and dirty scan


//


memset (MemoryBlock, 0, sizeof (MemoryBlock));


Stats [threadId].spinned++;


}


return;


}


//—————————————————————————–


// Function: lockThread


//


// Description:


//    Acquires and releases a lock in a loop


//


// Notes:


//


//   


void __cdecl


lockThread (void* param)


{


       ULONG_PTR    threadAffinity =(DWORD_PTR) param;


long    threadId = InterlockedIncrement (&NumberOfThreadsRunning)-1;


SetThreadAffinityMask (GetCurrentThread (), threadAffinity);


// Initialize affinity


//


Stats [threadId].affinity = threadAffinity;


while (1)


{


if (Lock == 0 && InterlockedExchange (&Lock, 1) == 0)


{


// Increment number of times we acquired lock


//


Stats [threadId].locked ++;


Lock = 0;


}


else


{


Stats [threadId].spinned++;


 


// Be HT friendly


//


SmtPause ();


continue;


}


}


return;


}


//—————————————————————————–


// Function:main


//


// Description:


//    Creates lock and scan threads, reporting stats every 10sec


//


// Notes:


//


//   


int __cdecl


main (


int argc,


char* argv [])


{


       LONG_PTR    threadId = 0;


ThreadInfo    threadInfo [LAST_THREAD];


DWORD_PTR    processAffinity;


DWORD_PTR    threadAffinity;


DWORD_PTR    systemAffinity;


int    counter = 0;


int    numberOfThreads = 0;


// Retrive process affinity


//


if (GetProcessAffinityMask (GetCurrentProcess (),


&processAffinity,


&systemAffinity) == FALSE)


{


       printf (“GetProcessAffinityMask failed – exiting\n”);


exit (-1);


}


// Initialize threads Information: First default and then check


// if client passed anything in


//


// Allocate all but one CPUs for threads running locking code


//


threadInfo [LOCK_THREAD].func = lockThread;


threadInfo [LOCK_THREAD].affinity = processAffinity & processAffinity-1;


// Allocate one CPU for scan thread


//


threadInfo [SCAN_THREAD].func = scanThread;


threadInfo [SCAN_THREAD].affinity =threadInfo [LOCK_THREAD].affinity ^ processAffinity;


// Read command line arguments: in simple way


//


for (counter = 0 ; counter< argc; counter++)


{


switch (*(argv [counter] + 1))


{


case ‘s’:


case ‘S’:


{


// Retrieve affinity mask for scan threads


//


threadInfo [SCAN_THREAD].affinity =


(DWORD_PTR) _atoi64 ((argv [counter] + 2));


break;


}


case ‘l’:


case ‘L’:


{


// Retrieve affinity mask for lock threads


//


threadInfo [LOCK_THREAD].affinity =


(DWORD_PTR) _atoi64 ((argv [counter] + 2));


break;


}


case ‘?’:


{


printf (“Usage: test -sn -lm \n”


“Example: test.exe -s8 -l7 (runs 1 scanning thread and 3 locking threads)\n”


“-sn is affinity of threads performing memory scan \n”


“-sm is affinity of threads acquiring lock”);


exit (0);


}


default:


{


break;


}


}


}


// Start threads of all types based on supplied affinities


//


for (counter = 0; counter < LAST_THREAD; counter++)


{


while (threadInfo [counter].affinity != 0)


{


// Separate affiinity for a thread we are trying to


// create from all affinity mask


//


 


// Remember full mask


//


threadAffinity = threadInfo [counter].affinity;


 


// Turn off least significant bit


//


threadInfo [counter].affinity &= threadInfo [counter].affinity -1;


 


// Calculate the affinity bit for the thread:


// Leave one bit on – affinity bit


//


threadAffinity ^= threadInfo [counter].affinity;


 


// Use beginthread for simplicity


//


if (_beginthread (threadInfo [counter].func,


NULL,


(void*) threadAffinity) == -1)


{


printf (“Failed to create thread – exiting\n”);


exit (-1);


}


numberOfThreads++;


}


}


// Loop and report stats periodically


//


while (1)


{


// Report every 10 sec


//


Sleep (10000);


 


for (threadId=0; threadId < numberOfThreads; threadId++)


{


printf (“Thread Id = %d\t, Thread Affinity = %p\t, Locked = %d\t, Spin = %d\t\n “,


threadId,


Stats [threadId].affinity,


Stats [threadId].locked,


Stats [threadId].spinned);


 


// Reset run time data


//


Stats [threadId].locked = 0;


Stats [threadId].spinned =0;


 


}


}


return 0;


}


As you can see one can specify an affinity for scan and lock threads at start up of the program.  


I ran sever experiments on my dev box which has 2 physical and 4 logical CPUs. CPU 1 (Affinity 0x1) & 3 (Affinity 0x4)  and 2 (Affinity 0x2) & 4 (Affinity 0x8) share physical CPUs respectively. The box also has Windows 2003 Server SP1 on it.  Below are the results:


test.exe -s8 -l7


Thread Id = 0  , Thread Affinity = 00000008    , Locked = 0    , Spin = 9035


Thread Id = 1  , Thread Affinity = 00000001    , Locked = 9290331      , Spin = 10373717


Thread Id = 2  , Thread Affinity = 00000002    , Locked = 676397       , Spin = 40879794


Thread Id = 3  , Thread Affinity = 00000004    , Locked = 10274030     , Spin = 42433521


test.exe -s4 -l11


Thread Id = 0  , Thread Affinity = 00000004    , Locked = 0    , Spin = 8943


Thread Id = 1  , Thread Affinity = 00000001    , Locked = 741166       , Spin = 11421139


Thread Id = 2  , Thread Affinity = 00000002    , Locked = 10747824     , Spin = 10024252


Thread Id = 3  , Thread Affinity = 00000008    , Locked = 10210221     , Spin = 13988134


test.exe -s1 -l14


Thread Id = 0  , Thread Affinity = 00000001    , Locked = 0    , Spin = 8472


Thread Id = 1  , Thread Affinity = 00000002    , Locked = 10550011     , Spin = 20156206


Thread Id = 2  , Thread Affinity = 00000004    , Locked = 722554       , Spin = 11211074


Thread Id = 3  , Thread Affinity = 00000008    , Locked = 11182506     , Spin = 25376166


test.exe -s2 -l13


Thread Id = 0  , Thread Affinity = 00000002    , Locked = 0    , Spin = 8585


Thread Id = 1  , Thread Affinity = 00000001    , Locked = 9900899      , Spin = 12267885


Thread Id = 2  , Thread Affinity = 00000004    , Locked = 8984095      , Spin = 11909080


Thread Id = 3  , Thread Affinity = 00000008    , Locked = 1297138      , Spin = 28199769


Great! The experiment confirms the theory. So does it mean you have to disable HT when using SQL Server? The answer is it really depends on the load and hardware you are using. 


You have to test your application with HT on and off under heavy loads to understand HT’s implications.


Keep in mind that not only lazywriter thread can cause slowdown but any thread that performs large memory scan – for example a worker thread that scans large amount of data might be a culprit as well.


For some customer applications when disabling HT we saw 10% increase in performance. So make sure that you do your home work before you decide to hyper on not to hyper 🙂


Hope this information was useful. Let me know if you have any questions!

Comments (35)

  1. We’ve noticed this effect too. Interestingly we’ve also seen it on Terminal Server/Citrix machines too. Turning off HT made an unusable system suddenly usable.

    I suspected the problem was something like this. What implications does this have for dual-proc machines? I’d guess very little as they don’t share cache and I presume there aren’t multiple system threads spawned (and I’d guess the code is a little smarter over cache coherency).

  2. Haidong Ji says:

    Great information Slava. Thanks for sharing.

    I’ve thoroughly enjoyed your blogs so far. Keep up the great work.

  3. John Stephens says:

    Can you please explain your results in more detail?

    I’m not sure what I’m supposed to be looking for.

  4. Scotty says:

    From your explanation I take it that multiple processors or multiple cores should not show the same problem because of the lack of shared L1 and L2 cache?

  5. slavao says:

    Q1: What implications does this have for dual-proc machines?

    Q2: From your explanation I take it that multiple processors or multiple cores should not show the same problem because of the lack of shared L1 and L2 cache?

    A12: That is correct. Since dual procs don’t share caches we shouldn’t see this behavior. The same stands for cores that share neither L1 nor L2.

    Q3: I’m not sure what I’m supposed to be looking for

    A3: From the post: …If the theory is correct a lock thread that shares a physical CPU with a scan thread should acquire lock much less often than other threads in a given period of time… From output above you could see whenever lock thread shares physical CPU with scan thread it acquires lock much less often as compare to other threads.

  6. audiofreak says:

    Shouldn’t you include Sleep(0) after SetThreadAffinityMask() before doing any work if you want to be sure that the thread runs on the core you requested?

    Also, shouldn’t pause instruction be inserted immediately BEFORE memory access instruction used to check the lock for it to have any effect?

  7. slavao says:

    Q1:Shouldn’t you include Sleep(0) after SetThreadAffinityMask() before doing any work if you want to be sure that the thread runs on the core you requested?

    A1:Nope, you don’t have to. If a call to the API succeeds the thread will be bound to the CPUs described by the mask. For more info see MSDN article http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/setthreadaffinitymask.asp.

    Q2:Also, shouldn’t pause instruction be inserted immediately BEFORE memory access instruction used to check the lock for it to have any effect?

    A2:No, you actually don’t have to. http://www.devx.com/assets/intel/9315.pdf has a good explanation for the pause

  8. guzmanda says:

    Thanks a lot for sharing, Slava.

    I ran a controlled application test on a 4-way (8 logical) server with HT disabled/enabled at the BIOS level and observed 15-20% improvement with HT enabled. Perhaps my observation was contrary to your curtomer’s experience because our app had an OLTP profile, with a ratio of about 90% read. It may be that the lasywriter didn’t have much work to do and there were few scans in my case.

    I believe this underscores your statement about the importance of application testing under load in order to understand HT implications.

  9. Jon A [UK] says:

    Just ran some testing with 6GB DB that is updated every minute 24/7 and has frequent reporting. One of the more CPU intensive reports runs:

    with HT enabled: 13 seconds

    with HT disabled (thru BIOS): 7 seconds

    Was testing with:

    Dell PowerEdge 2850

    W2003 Server Std

    SQL2005 Std Edition

    2 x XEON CPUs

    4GB RAM

    I’m going to do more experimenting before deciding on disabling on our production server.

  10. Jon A [UK] says:

    Just ran some testing with 6GB DB that is updated every minute 24/7 and has frequent reporting. One of the more CPU intensive reports runs:

    with HT enabled: 13 seconds

    with HT disabled (thru BIOS): 7 seconds

    Was testing with:

    Dell PowerEdge 2850

    W2003 Server Std

    SQL2005 Std Edition

    2 x XEON CPUs

    4GB RAM

    I’m going to do more experimenting before deciding on disabling on our production server.

  11. Jerry Foster says:

    When we switched to 2000 sp4, our system became almost unusable at times. We noticed a drastic increase in CPU usage but performance just tanked, especially under load. I wish Slava could run his test after upgrading to sp4 to see if it gets worse.

    We’ve considered disabling HT, but are afraid to risk making our current problems even worse.

    We do have before/after test environments, but unfortunately we simply cannot recreate the necessary load level from our production box to accurately test.

  12. Rosa says:

    Does setting the "max degree of parallelism" in SQL Server to the number of physical processors have the same effect as disabling HT? Or is it better to disable HT?

  13. jrobertobr says:

    I think reducing max degree of parallelism could cause a query to run only on the logical processors, because all processors are still avaliable to SQL.

    I guess disabling the use of logical processors by SQL server (Processor Control under SQL Server Properties), not disbling HTT on BIOS should give the same performance improvements on SQL server, leaving the OS with all the processors.

    Again, it’s all a case-by-case situation, that must be tested before any action to be taken.

  14. Rusty says:

    A well presented article. As a matter of precaution on my SQL servers I disabled HP by default.

  15. We recently upgraded from SQL Server 2005 (from SQL Server 2000) and also simultaneously the hardware…

  16. Lara's Blog says:

    For those who are challenged with deciphering how to configure the max degree of parallelism on a server…

  17. Lara's Blog says:

    For those who are challenged with deciphering how to configure the max degree of parallelism on a server

  18. Glen Sidelnikov says:

    Would this explain an actual performance degradation we are currently seeing on our production box? We are using x64 8 XEON CPU with 16 GB or RAM with WIndows 2003 R2 Enterprise and SQL 2005 64 bit Enterprise. "Production" box is still in testing mode. There is no heavy load. Sproc batch that is running 15 seconds on the "test" box (Two Zeon CPU x86 with 4 GB of RAM running Windows 2003 32 bit + SQL 2005 Standard) is running 24 seconds on production.

    Thank you.

  19. slavao says:

    Most likely not. In order to find the real culprit you will need to do perf analysis

    1. Make hardware sanity check: Disks, CPUs cache sizes, and etc… For example It is better for 64bit systems have larger CPU caches.

    2. Check query plan

    3. Check perfmon counters and find possible discrepancies that will explain the degradation.

  20. MrViklund answered: re:After Intel launched the new dual core processors, is Hyperthreading still relevant? Is it still being used in new Intel processors?

  21. Kevin Kline says:

    Just a quick note to send a big Thank You to Christoph Stotz of Frankfurt, Germany for his hospitality

  22. Ο προβληματισμός γεννήθηκε στην μέση της εγκατάστασης μιας εφαρμογής που έκανε χρήση των Analysis Services

  23. Welcome to the Dynamics Ax Performance Team’s blog. We’re putting together a team introduction and hope

  24. Assumptions : Dedicated SQL Server 2005 Server (does not run any other major applications besides SQL