Hardware bitflipping

Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.

Here's is what I found:

The system got the following bugcheck:

0: kd> .bugcheck
Bugcheck code 000000D1
Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6

Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference

!analyze -v provides the following information for this bugcheck:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.
Arguments:
Arg1: fffff80003a192d0, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff980064aa8b6, address which referenced memory

Debugging Details:
------------------

READ_ADDRESS: fffff80003a192d0

CURRENT_IRQL: 2

So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:

0: kd> kv 3
Child-SP RetAddr : Args to Child : Call Site
fffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73
fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)

Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)

0: kd> .trap fffff800`03219070
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011
rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001
rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6
r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
tcpip!InetInspectReceiveDatagram+0xf6:
fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13

As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.

0: kd> r cr2
cr2=fffff80003a192d0

Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)

0: kd> .formats fffff800`032192d0
Evaluate expression:
Hex: fffff800`032192d0
Decimal: -8796040490288
Octal: 1777777600000310311320
Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
Chars: .....!..
Time: ***** Invalid FILETIME
Float: low 4.74822e-037 high -1.#QNAN
Double: -1.#QNAN

0: kd> .formats fffff800`03a192d0
Evaluate expression:
Hex: fffff800`03a192d0
Decimal: -8796032101680
Octal: 1777777600000350311320
Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
Chars: ........
Time: ***** Invalid FILETIME
Float: low 9.49644e-037 high -1.#QNAN
Double: -1.#QNAN

Notice that there is a one bit difference between these 2 addresses

11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000

11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000

Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.