What Should Never Happen... Did

Hi, this is Bob Golding; I wanted to write a blog about an interesting hardware issue I ran into. Hardware problems can be tricky to isolate. I recently came across one that I thought was interesting and gave an example of how to trace code execution.  The machine executed the filler “int 3” instructions generated by the compiler.  Execution should never reach these filler instructions, so we needed to determine how the instruction pointer got there.

What was the issue?

The issue was a bug check 8E (unhandled exception).  The exception was a debug exception (80000003), because a filler INT 3 instruction was executed.

 

1: kd> .bugcheck

Bugcheck code 0000008E

Arguments 80000003 8082e3e0 f78aec38 00000000

 

Below is the trap frame (the trap frame is the third argument in the bugcheck code). Note that the actual trapping instruction is 8082e3e0, the instruction pointer is incremented before the INT 3 generates a trap. The correct EIP is reported in the bug check values.

 

1: kd> .trap f78aec38

ErrCode = 00000000

eax=0e364a01 ebx=f78aed50 ecx=f78aecb8 edx=000000e1 esi=f7727ec0 edi=f75d9ca2

eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0         nv up ei pl nz na po nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00000202

nt!KiCheckForKernelApcDelivery+0x37:

8082e3e1 cc              int     3

 

The machine executed code at the end of a function that should never be executed.  This should never happen.

 

1: kd> ub 8082e3e1

nt!KiCheckForKernelApcDelivery+0x1c:

8082e3c6 32c9            xor     cl,cl

8082e3c8 ff2508118080    jmp     dword ptr [nt!_imp_KfLowerIrql (80801108)]

8082e3ce 64a124010000    mov     eax,dword ptr fs:[00000124h]

8082e3d4 c6403d01        mov     byte ptr [eax+3Dh],1

8082e3d8 ff2598108080    jmp     dword ptr [nt!_imp_HalRequestSoftwareInterrupt (80801098)]

8082e3de cc              int     3 <<< This is after the function.

8082e3df cc              int     3

8082e3e0 cc              int     3

 

OK, So Now What?

Now, we need to find the execution path that caused the machine to execute this INT 3. There are places to look to find clues that will tell us. The first place to start looking is the stack. If a “call” instruction was made, the return will be pushed on the stack. This way we can try to determine if we arrived at this bad instruction pointer from a call or a ret instruction.

 

Using the value from esp in the above trap frame, let’s dump the stack.

 

1: kd> dps f78aecac

f78aecac  80a5d8fc hal!HalpIpiHandler+0xcc <<< Interesting?

f78aecb0  f78aecc0

f78aecb4  00000000

f78aecb8  00000002

f78aecbc  000000e1

f78aecc0  f78aed50

f78aecc4  f75d9ca2

f78aecc8  badb0d00

f78aeccc  000086a8

 

In looking at the stack dump, we see that there may have been a call from HalpIpiHandler.  Let’s dump the code leading up to hal!HalpIpiHandler+0xcc to see what it did.

 

1: kd> ub 80a5d8fc

hal!HalpIpiHandler+0xb1:

80a5d8e1 e8d82b0000      call    hal!HalBeginSystemInterrupt (80a604be)

80a5d8e6 0ac0            or      al,al

80a5d8e8 7509            jne     hal!HalpIpiHandler+0xc3 (80a5d8f3)

80a5d8ea 83c408          add     esp,8

80a5d8ed ff2510b0a580    jmp     dword ptr [hal!_imp_Kei386EoiHelper (80a5b010)]

80a5d8f3 6a00            push    0

80a5d8f5 55              push    ebp

80a5d8f6 ff1520b0a580    call    dword ptr [hal!_imp__KiIpiServiceRoutine (80a5b020)]

 

In the above assembly, we can see that there is a call made using a pointer in the import table.  Now, let’s have a look at that pointer.

 

1: kd> dd 80a5b020 l 1

80a5b020 8082e3e4

 

The pointer is very close to the instruction we trapped on.  Is this a coincidence?

 

It looks like due to an effective address calculation failure, the machine starting executing at 8082e3e0 instead of 8082e3e4.  Somewhere in the data path the processor executing this instruction stream dropped bit three, turning a 4 into a 0.

 

1: kd> ?0y0100

Evaluate expression: 4 = 00000000`00000004

1: kd> ?0y0000

Evaluate expression: 0 = 00000000`00000000

 

What does all of this mean?

There is some circumstantial evidence here that the machine was in the IPI handler.  The IPI Handler is used in multiprocessor systems so that one processor may interrupt another.  So, how can we further prove this is where we were?  Let’s try to match the trap frame registers to the assembly from HalpIpiHandler before it calls KiIpiServiceRoutine.

 

1: kd> .trap f78aec38

ErrCode = 00000000

eax=0e364a01 ebx=f78aed50 ecx=f78aecb8 edx=000000e1 esi=f7727ec0 edi=f75d9ca2

eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0         nv up ei pl nz na po nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00000202

nt!KiCheckForKernelApcDelivery+0x37:

8082e3e1 cc              int     3

1: kd> ub 80a5d8fc l10

hal!HalpIpiHandler+0x94:

80a5d8c4 c74508000ddbba  mov     dword ptr [ebp+8],0BADB0D00h <<< EBP+ 8 does equal badb0d00

80a5d8cb 895d00          mov     dword ptr [ebp],ebx <<< ebp does equal ebx

 

80a5d8ce 897d04          mov     dword ptr [ebp+4],edi <<< ebp+4 does equal edi

80a5d8d1 68e1000000      push    0E1h

80a5d8d6 83ec04          sub     esp,4

80a5d8d9 54              push    esp

80a5d8da 68e1000000      push    0E1h

80a5d8df 6a1d            push    1Dh

80a5d8e1 e8d82b0000      call    hal!HalBeginSystemInterrupt (80a604be)

80a5d8e6 0ac0            or      al,al <<< al is equal to 1

80a5d8e8 7509            jne     hal!HalpIpiHandler+0xc3 (80a5d8f3)

80a5d8ea 83c408          add     esp,8

80a5d8ed ff2510b0a580    jmp     dword ptr [hal!_imp_Kei386EoiHelper (80a5b010)]

80a5d8f3 6a00            push    0

80a5d8f5 55              push    ebp

80a5d8f6 ff1520b0a580    call    dword ptr [hal!_imp__KiIpiServiceRoutine (80a5b020)]

 

Below is the stack data that the instructions above produced (borrowed from the dps output above). It matches, so the machine did execute the above instructions.

 

f78aecac  80a5d8fc  hal!HalpIpiHandler+0xcc <<< Return Pushed by call instruction

f78aecb0  f78aecc0 <<< Pushed EBP

f78aecb4  00000000 <<< Pushed a 0

 

Finally, the last bit of evidence:

Let's view the IPI state using the debugger command !ipi.  From this output we can see that processor 1 is the receiver of a cross interrupt from Processor 0.  This is consistent with the data we found on the stack.

 

1: kd> !ipi

IPI State for Processor 0

 

    As a sender, awaiting packet completion from processor 1.

 

    TargetSet             2  PacketBarrier         e  IpiFrozen      2 [Frozen]

    IpiFrame       8089a570  SignalDone     00000000  RequestSummary 0

    Packet State     Active                           WorkerRoutine  nt!KiFlushTargetMultipleTb

    Parameter[0]   00000000  Parameter[1]   80899f10  Parameter[2]   80899f04

 

IPI State for Processor 1

 

    As a receiver, the following unhandled requests are pending: [Packet] [DPC]

 

    TargetSet             0  PacketBarrier         0  IpiFrozen      0 [Running]

    IpiFrame       b5ad7be8  SignalDone     ffdff120  RequestSummary 2 [DPC]

    Packet State      Stale                           WorkerRoutine  nt!KiFlushTargetMultipleTb

    Parameter[0]   00000000  Parameter[1]   b5ad7950  Parameter[2]   b5ad7948

 

What went wrong?

Based on the evidence in this dump it appears that call instruction transferred execution to the wrong address.  The machine ended up executing at address 8082e3e0 instead of 8082e3e4, a single bit difference.  This same bit was flipped in several crashes from this machine, so all the evidence pointed to a faulty processor.  After replacing the processor we were running on when we bugchecked, the issue did not occur again.

Hardware can sometimes cause pretty specific failures, such as the flipped bit we see here.  To determine that this failure was a hardware issue, we had to reconstruct the execution path and trace how we ended up at the failing instruction.  We were able to match the register contents to what they would have been before the call to KiIpiServiceRoutine.  This demonstrated that the call should have been made to KiIpiServiceRoutine, but it unexpectedly went to the wrong address.