SQL Server and the Bad CPU

From time to time I encounter an issue where the physical CPU is the source of the problem.   I hope you never encounter such and issue but I wanted to provide an example to help those that will.

Customer started getting frequent dumps from the SQL Server process.   Looking at the exception.log in the SQL Server LOG directory the following was repeated 100s of times.

Exception 0xc0000005 EXCEPTION_ACCESS_VIOLATION writing address 010BC7B5 at 0x010BC7B4
Exception 0xc0000005 EXCEPTION_ACCESS_VIOLATION writing address 010BC7B5 at 0x010BC7B4
....

What changed on the system, etc...   Nothing was obvious that would have lead to such behavior.

The problem was always the fstp instruction attempting to push st0 onto the stack as local variable storage.  This was happening for many queries, as this code is accessed in 197 different locations during the compile phase.  Even select * from sys.databases was encountering the problem from time to time. 

eax=65f3f290 ebx=47ce2fa0 ecx=65f3f290 edx=0000027f esi=47ce2bd0 edi=65f3f30c

eip=010bc7b4 esp=65f3f26c ebp=65f3f294 iopl=0 nv up ei pl nz na po nc

fpcw=027F: rn 53 puozdi fpsw=7A20: top=7 cc=1010 --p----- fptw=3FFF

fopcode=00FC fpip=001b:010bc7a3 fpdp=0023:b980389c

st0= 3.595386269724631416290e+0308 st1= 5.525770000000000436560e+0004

st2= 1.500800000000004047250e+0003 st3= 1.000000000000000000000e+0000

st4= 0.000000000000000000000e+0000 st5= 8.000000000039999559220e-0001

st6= 8.000000000039999559220e-0001 st7= 1.000000000000000000000e+0000

010bc7b4 dd5dd8 fstp qword ptr [ebp-28h] ss:0023:65f3f26c=47ce2c00010e4dad

Looking closer I noticed that st0 was a strange floating point value but SQL Sever compilation expects to handle floating point arithmetic errors and this was not a floating point exception but an access violation.

The error indicated it was a write issue and the write location should be EBP-28h (local stack).  The EBP and ESP were on a valid operating system page so we are not bumping into a guard page (like chkstk can do, triggering the access violation AV).

The strange part of the scenario was the writing address.   It was always 1 byte past the current EIP instead of EBP-28h.  It was also always the same address, even across SQL Server restarts.   This typically rules out any memory scribbler issues because the problem is not moving around.

I was debugging mini-dumps from the customer.  The mini-dump does not contain a copy of the image so I am looking at my image with the customers dump.   My first thought was that the SQLServr.exe image was damaged and the fstp instruction was really  fstp qword ptr[EIP] instead of fstp qword ptr[EBP-28h].   This would match the pattern that every time we attempted to execute the instruction we used the instruction pointer + 1 and we can't write to a code segment, and the issue would be repeatable.  (fstp is documented as storing the ST0 register at the specified location)

This theory did not hold up because SQL Dumper automatically checks the image at dump time and had not reported any image corruption in the sqldumper.log.  Also, 100s of queries were still working, it would just happen from time to time.  If the image was truly damaged it should happen on any basic compile operation.  I also collected a dump, including the image and it was not damaged.

I then started looking at the SQLDump####.txt files generated by the SQL Server during the handling of the exception.   I noticed that the same scheduler id continued to be in the mix during a specific process instance.  This points more to a hardware issue now because it seems like a single CPU might be at play.

To help prove my theory I setting the affinity mask (sp_configure) for every CPU.  For example if you have a 4 proc system you set the mask to 1111 (binary) decimal 15.  This locks (affinitizes) each logical, SQL Server, scheduler onto a specific CPU.   Without the affinity mask the threads are allowed to move between the CPUs and it can disrupt the pattern we are trying to obtain.    After locking each SQL Server scheduler all I had to do was keep checking the SQLDump###.txt files to see what scheduler was reported.    Then I can map the scheduler to the affinity mask and I know which CPU is not playing well.

Customer replaced the CPU and the problem went away.

Bob Dorr - Principal SQL Server Escalation Engineer