Hardware bitflipping

Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.

Here’s is what I found:

The system got the following bugcheck:

0: kd> .bugcheck
Bugcheck code 000000D1
Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6

Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference

!analyze -v provides the following information for this bugcheck:

An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.
Arg1: fffff80003a192d0, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff980064aa8b6, address which referenced memory

Debugging Details:

READ_ADDRESS: fffff80003a192d0


So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let’s get the trap frame so we can get into context of the system when the crashed happened:

0: kd> kv 3
Child-SP RetAddr : Args to Child : Call Site
fffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73
fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)

Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)

0: kd> .trap fffff800`03219070
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011
rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001
rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6
r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13

As you can see above fffff800`032192d0 looks like valid memory and shouldn’t normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.

0: kd> r cr2

Looking at this address it is clear that the trap frame does not exactly match, so let’s look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)

0: kd> .formats fffff800`032192d0
Evaluate expression:
Hex: fffff800`032192d0
Decimal: -8796040490288
Octal: 1777777600000310311320
Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
Chars: …..!..
Time: ***** Invalid FILETIME
Float: low 4.74822e-037 high -1.#QNAN
Double: -1.#QNAN

0: kd> .formats fffff800`03a192d0
Evaluate expression:
Hex: fffff800`03a192d0
Decimal: -8796032101680
Octal: 1777777600000350311320
Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
Chars: ……..
Time: ***** Invalid FILETIME
Float: low 9.49644e-037 high -1.#QNAN
Double: -1.#QNAN

Notice that there is a one bit difference between these 2 addresses

11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000

11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000

Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.

Comments (8)

  1. Ralph Gifford says:

    Very interesting.  I have issue with my wifes PC and would want to try this fix on that.

  2. Bender says:

    Very nice article. Good thing that hardware issues can also be diagnosed with a memory test like the Vista integrated or memtest86+. Good news for the non-kernel debugger enlightened users like me 🙂

  3. Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers

  4. From elsewhere in the collective.

  5. Igor Levicki says:

    Strange way of approaching the troubleshooting.

    I would have booted into DOS and used goldmen or memtest86+ to check for memory errors — that would catch the bit without any of the above brainstorming over a crash dump being required.

    Furthermore, you usually do not add RAM sticks with different timings, bank sizes, or voltage requirements to the system — if you can’t find the same RAM you pull the old one out.

    2.1V is probably not neccessary unless it is a higher clocked / high performance DDR2 memory, better check the specification, RAM can overheat and the BIOS can enable thermal throttling so the system might run slower with more voltage than neccessary.

    Finally, if they installed 4 RAM sticks instead of 2 then the "solution" of rising voltage might have to do with RAM but with crappy mainboard and northbridge which aren’t able to supply enough "juice" to drive all 4 memory modules.

    In any case, I would try running goldmem and memtest86+ at 1.9v, 1.95v, etc until I find the minimum voltage at which the system is stable.

  6. wolf550e says:

    memtest86 should be able to diagnose such a problem, without digging into windows.

  7. This is exactly the kind of walk-through that teaches tips and techniques everyone computer literate and in charge of designing, building, and supporting systems should add to their tool belt. I want to say that this hasn’t helped me resolve a problem right now, but adding skills is always a good thing. But there isn’t a ‘not yet’ button.

    When I press the shiny green button for "Did this blog post help you resolve a problem?", I get an error:

    500 – Internal server error.

    There is a problem with the resource you are looking for, and it cannot be displayed.

    Keep these articles coming!

    And yes, memory testing programs can and do find problems. But I’ve seen strange errors that testing programs don’t find but operating systems, applications, and games do find. Ask me about the 6502 add bug sometime… (showing my gray hair)

  8. ........... says:

    The first thing you should assume about a blue screen is that it is caused by software (drivers), so opening the crash dump with a debugger and is the correct first step. It's most likely going to be waste of time if you use Memtest86 as the first step. If you see an obvious exception/instruction mismatch or a bit flip in the debugger, or if the bugcheck error code and/or the stack trace is different in each dump (especially you get random stack trace even with Verifier enabled), then use a hardware stress test program (like Memtest86).

    [It's good to see our older articles are still generating interest.]