Debugging walkthrough: Access violation on nonsense instruction


A colleague of mine asked for help puzzling out a mysterious crash dump which arrived via Windows Error Reporting.

rax=00007fff219c5000 rbx=00000000023c8380 rcx=00000000023c8380
rdx=0000000000000000 rsi=00000000043f0148 rdi=0000000000000000
rip=00007fff21af2d22 rsp=000000000392e518 rbp=000000000392e580
 r8=00000000276e4639  r9=00000000043b2360 r10=00000000ffffffff
r11=0000000000000000 r12=0000000000000001 r13=0000000000000000
r14=000000000237cfc0 r15=00000000023d3ea0
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
nosebleed!CNosebleed::OnFrimble+0x1f891a:
00007fff`21af2d22 30488b xor byte ptr [rax-75h],cl ds:00007fff`219c4f8b=41

Well that's a pretty strange instruction. Especially since it doesn't match up with the source code at all.

void CNosebleed::OnFrimble(...)
{
    ...
    if (CanFrumble(...))
    {
        ...
    }
    else
    {
        hr = pCereal->AddMilk(pCarton);
        if (SUCCEEDED(hr))
        {
            pCereal->Snap();
            pCereal->Crackle(false);
            if (SUCCEEDED(pCereal->Pop(uId)) // ← crash here
            {
                ....
            }
        }
    }
    ....
}

There is no bit-toggling in the actual code. The method calls to Snap, Crackle, and Pop are all interface calls and therefore should be vtable calls. We are clearly in a case of a bogus return address, possibly a stack smash (and therefore cause for concern from a security standpoint).

My approach was to try to figure out what was happening just before the crash. And that meant figuring out how we ended up in the middle of an instruction.

Here is the code surrounding the crash point.

00007fff`21af2d17 ff90d0020000    call    qword ptr [rax+2D0h]
00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx]
00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
00007fff`21af2d23 488bcb          mov     rcx,rbx

Notice that the code that crashed is actually the last byte of the mov edx, dword ptr [rbp+30h] (the 30) and the first two bytes of the mov rcx, rbx (the 488b).

Disassembling backward is a tricky business on a processor with variable-length instructions, so to get my bearings, I looked for the call to Can­Frumble:

0:011> #CanFrumble nosebleed!CNosebleed::OnFrimble
nosebleed!CNosebleed::OnFrimble+0x1f883b
00007fff`21af2c43 e8e0e40f00 call nosebleed!CNosebleed::CanFrumble

The # command means "Start disassembling at the specified location and stop when you see the string I passed." This is an automated way of just hitting u until you get to the thing you are looking for.

Now that I am at some known good code, I can disassemble forward:

00007fff`21af2c48 488bcb          mov     rcx,rbx
00007fff`21af2c4b 84c0            test    al,al
00007fff`21af2c4d 0f849a000000    je      nosebleed!CNosebleed::OnFrimble+0x1f88e5 (00007fff`21af2ced)

The above instructions check whether the Can­Frumble returned true, and if not, it jumps to 00007fff`21af2ced. Since we know that we are in the false path, we follow the jump.

// Make a vtable call into pCereal->AddMilk()
00007fff`21af2ced 488b03          mov     rax,qword ptr [rbx] ; vtable
00007fff`21af2cf0 498bd7          mov     rdx,r15 ; pCarton
00007fff`21af2cf3 ff9068010000    call    qword ptr [rax+168h] ; call
00007fff`21af2cf9 8bf8            mov     edi,eax ; save to hr
00007fff`21af2cfb 85c0            test    eax,eax ; succeeded?
00007fff`21af2dfd 0f880dffffff    js      nosebleed!CNosebleed::OnFrimble+0x1f8808 (00007fff`21af2c10)

// Now call Snap()
00007fff`21af2d03 488b03          mov     rax,qword ptr [rbx] ; vtable
00007fff`21af2d06 488bcb          mov     rcx,rbx ; "this"
00007fff`21af2d09 ff9070020000    call    qword ptr [rax+270h] ; Snap

/ Now call Crackle
00007fff`21af2d0f 488b03          mov     rax,qword ptr [rbx] ; vtable
00007fff`21af2d12 33d2            xor     edx,edx ; parameter: false
00007fff`21af2d14 488bcb          mov     rcx,rbx ; "this"
00007fff`21af2d17 ff90d0020000    call    qword ptr [rax+2D0h] ; Crackle

// Get ready to Pop
00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx] ; vtable
00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h] ; uId
00007fff`21af2d23 488bcb          mov     rcx,rbx ; "this"

But we never got to execute the Pop because our return address from Crackle got messed up.

Let's follow the call into Crackle.

0:011> dps @rbx l1
00000000`02b4b790  00007fff`219c50a0 nosebleed!CCereal::`vftable'
0:011> dps 00007fff`219c50a0+2d0 l1
00007fff`219c5370  00007fff`21aa5c28 nosebleed!CCereal::Crackle
0:011> u 00007fff`21aa5c28
nosebleed!CCereal::Crackle:
00007fff`21aa5c28 889163010000    mov     byte ptr [rcx+163h],dl
00007fff`21aa5c2e c3              ret

So at least the pCereal pointer seems to be okay. It has a vtable and the slot in the vtable points to the function we expect. The Crackle method merely stashes the bool parameter into a member variable. No stack corruption here because rbx is nowhere near rsp.

0:012> db @rbx+163 l1
00000000`02b4b8f3  ??                                               ?

Sadly, the byte in question was not captured in the dump, so we cannot verify whether the call actually was made. Similarly, the members of CCereal manipulated by the Snap method were also not captured in the dump, so we can't verify that either. (The only member of CCereal captured in the dump is the vtable itself.)

So we can't find any evidence one way or the other as to whether any of the calls leading up to Pop actually occurred. Maybe we can try to figure out how many misaligned instructions we managed to execute before we crashed, see if that reveals anything. To do this, I'm going to disassemble at varying incorrect offsets and see which ones lead to the instruction that crashed.

0:011> u .-1 l2
nosebleed!CNosebleed::OnFrimble+0x1f8919:
00007fff`21af2d21 55              push    rbp
00007fff`21af2d22 30488b          xor     byte ptr [rax-75h],cl
// ^^ this looks interesting; we'll come back to it

0:011> u .-3 l2
nosebleed!CNosebleed::OnFrimble+0x1f8917:
00007fff`21af2d1f 038b5530488b    add     ecx,dword ptr [rbx-74B7CFABh]
00007fff`21af2d25 cb              retf
// ^^ this doesn't lead to the crashed instruction

0:011> u .-4 l2
nosebleed!CNosebleed::OnFrimble+0x1f8916:
00007fff`21af2d1e 8b03            mov     eax,dword ptr [rbx]
00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
// ^^ this doesn't lead to the crashed instruction

0:012> u .-5 l3
nosebleed!CNosebleed::OnFrimble+0x1f8914:
00007fff`21af2d1c 00488b          add     byte ptr [rax-75h],cl
00007fff`21af2d1f 038b5530488b    add     ecx,dword ptr [rbx-74B7CFABh]
00007fff`21af2d25 cb              retf
// ^^ this doesn't lead to the crashed instruction

0:012> u .-6 l3
nosebleed!CNosebleed::OnFrimble+0x1f8913:
00007fff`21af2d1b 0000            add     byte ptr [rax],al
00007fff`21af2d1d 488b03          mov     rax,qword ptr [rbx]
00007fff`21af2d20 8b5530          mov     edx,dword ptr [rbp+30h]
// ^^ this doesn't lead to the crashed instruction

Exercise: Why didn't I bother checking .-2?

You only need to test as far back as the maximum instruction length, and in practice you can give up much sooner because the maximimum instruction length involves a lot of prefixes which are unlikely to occur in real code.

The only single-instruction rewind that makes sense is the push rbp. Let's see if it matches.

0:011> ?? @rbp
unsigned int64 0x453e700
0:011> dps @rsp l1
00000000`0453e698  00000000`0453e700

Yup, it lines up. This wayward push is also consistent with the stack frame layout for the function.

nosebleed!CNosebleed::OnFrimble:
00007fff`218fa408 48895c2410      mov     qword ptr [rsp+10h],rbx
00007fff`218fa40d 4889742418      mov     qword ptr [rsp+18h],rsi
00007fff`218fa412 55              push    rbp
00007fff`218fa413 57              push    rdi
00007fff`218fa414 4154            push    r12
00007fff`218fa416 4156            push    r14
00007fff`218fa418 4157            push    r15
00007fff`218fa41a 488bec          mov     rbp,rsp
00007fff`218fa41d 4883ec60        sub     rsp,60h

The values of rbp and rsp should differ by 0x60.

0:012> ?? @rbp-@rsp
unsigned int64 0x68

The difference is in error by 8 bytes, exactly the size of the rbp register that was pushed.

It therefore seems highly likely that the push rbp was executed.

Repeating the exercise to find the instruction before the push rbp shows that no instruction fell through to the push rbp. Therefore, execution jumped to 00007fff`21af2d21 somehow.

Another piece of data is that rax matches the value we expect it to have, sort of. Here are some selected lines from earlier in the debug session:

// What we expected to have executed
00007fff`21af2d1e 8b03            mov     eax,dword ptr [rbx]

// The value we expected to have fetched
0:011> dps @rbx l1
00000000`02b4b790  00007fff`219c50a0 nosebleed!CCereal::`vftable'

// The value in the rax register
rax=00007fff219c5000 ...

The value we expect is 00007fff`219c50a0, but the value in the register has the bottom eight bits cleared.

Putting this all together, my theory is that the CPU executed the instruction at 00007fff`21af2d1e, and then due to some sort of hardware failure, instead of incrementing the rip register by two, it (1) incremented it by three, and then (2) as part of its confusion, zeroed out the bottom byte of rax. The erroneous rip led to the rogue push rbp and the crash on the nonsensical xor.

It's not a great theory, but it's all I got.

As to what sort of hardware failure could have occurred: This particular failure was reported twice, so a cosmic ray is less likely to be the culprit (because you have to get lightning to strike twice) than overheating or overclocking.

Comments (20)
  1. sense says:

    A nonsensical bug needs a nonsensical conclusion: It just makes sense!

    As a colleague of mine once said, when a program works illogically, don't try to fix it by using logic. Illogical problems need illogical solutions.

  2. Gabe says:

    If this sort of error happened more than once, could that point to a processor erratum?

  3. Cesar says:

    Does the dump capture the code, or just the data? If it doesn't capture the code, could it be that the code on disk on the victim's computer was corrupted, or that the code on the victim's memory (the disk cache to be more precise) was corrupted?

    Another possible culprit: interrupts. An interrupt can happen anywhere, and a broken interrupt handler could either fail at restoring the state (unlikely since it's heavily exercised common code), or more likely corrupt the state save area (I don't know how it's on Windows, but on Linux the state save area is part of the kernel stack).

    Yet another possible culprit: SMI interrupts. They can happen anywhere (even when normal interrupts are disabled), what they do can depend on the motherboard and BIOS revision, they do mysterious things in response to unknown inputs (for instance, a temperature threshold being crossed could result in a fan speed adjustment, both the input and output being via magic undocumented registers), and they come from the BIOS (which does not have a reputation for code quality).

    And how about this one? The code might have been running under virtualization (does the dump have DMI strings which could reveal the presence of a VM?), and the virtualization software's JIT could have a bug which mistranslates a particular sequence of instructions.

    And there's the always-popular "blame a virus" option: some sort of badware hooking the software, and failing to correctly emulate the code it replaced.

  4. EduardoS says:

    The counting is incorrect:

    00007fff`21af2d22 =  0; Ok

    00007fff`21af2d21 = -1; Ok

    00007fff`21af2d1f = -3; Ok; Did you skiped -2 because it was already disassembled in the correct path?

    00007fff`21af2d1e = -4; Ok;

    00007fff`21af2d1e = -5; Ops… Shouldn't be -6? You skiped the correct return address

    00007fff`21af2d1e = -6; Ops… Shouldn't be -7?

    Also, I am not sure you can blame overclocking for it, but I am sure you don't have enough information to point the exact reason.

  5. Azarien says:

    @Cesar: I'd say that VM bug still counts as (emulated) hardware error ;-)

  6. Killer{R} says:

    Did you have full (user) memory dump but not minidump only, so you can be sure that instructions you see are really were in user's memory, but not loaded by your debugger from your local symbols/images store? I've seen 'magic' crashes due to few bytes corrupted of executable disk image placed on flash stick (think same can happen with HDD/SDD). This would explain double-lightning-strike: just virtually replace 2 bytes @21af2d20 with xor al, al (30 c0 ) and you will get same crash if I didn't miss something important.

  7. Killer{R} says:

    However, from 'the byte in question was not captured in the dump,' I can assume that dump was not full so /me 90% sure that either image was corrupted on disk either FS cache memory corruption occurred, so couple of executions crashed in same place. However one of the simple but useful tests after inserting new RAM into you computer – is to perform cached copy of several tens GB of files and to compare results with fc /b. Sometimes system works rather stable but couple of faulty bytes of RAM poison your data.. drop by drop..

    [The minidump captures the bytes near the crashing EIP, so if the code were corrupted in memory/on disk, the corrupted values would have been captured. -Raymond]
  8. rgorton says:

    The odds are high that pCereal->Crackle(false) or part of pCereal->Pop(uId) is running the desctructor for pCereal, and the random values in memory are mostly correct.  You might try explicitly setting the pCereal pointer to be 0xdeadbeefdeadbeef or some such in the desctructor.

    The second most likely case is that there is a multi-threaded race condition

  9. Killer{R} says:

    [The minidump captures the bytes near the crashing EIP, so if the code were corrupted in memory/on disk, the corrupted values would have been captured. -Raymond]

    If only corrupted cache page was not reset'ed after crash and reloaded from image before dump. But chances to happen that twice even more reduces chances of all that disaster (if code in both dumps was checked).

    BTW its also interesting to see Pop()'s code. And code that checks result of Pop(). Another possible reason is erroneous decrementing of some on-stack value inside Pop(), so return address was decremented by 10, so after return from Pop() rip occasionally pointed to 10 bytes before than it should be.

    Corrupted rax could be explained by optimizer treated return value to fit into byte instead of full-sized HRESULT. Showing coede of call'ing Pop and checking its result would give hint if that can be true or eax can't has such value and in such case whole guess is incorrect.

  10. stephen says:

    Try !chkimg or capture 16 bytes before and 16 bytes after the EIP, perhaps it's inline hooked :-)

  11. Mark says:

    Killer{R}: highly unlikely the corrupted page would be freed and then reloaded from disk while the process is crashing.

  12. Killer{R} says:

    IMHO there is lim(0/0). Upper 0 – is probability to happen two different hardware malfunctions appeared as crash on same instruction, lower 0 – probability of one malfunction * probability to happen that things with cache page preemption. However there is nothing strange to happen that – 'process crashing' means that crashed process's thread stopped execution (went to wait to something), then another process being launched (debugger, werfault etc) and then that another process reads memory of crashed one. Under low memory condition cache can be shrinked easily in between to get some place for debugger's stuff. Another example: its usual situation when got full (kernel) memory dump in which PEB/TEB/stack of some (usually, thanks 2 Murphy's law – most important) process is paged out.

  13. cheong00 says:

    I read somewhere something similar to the following: The crash dump is generated by immediately page out the current memory at the time of crash as separate file. (The memory is blurred one so I can't find a direct reference)

    So I think at the time a process is crashed, the dump file is very very unlikely to be a freed and reloaded one.

  14. EduardoS says:

    Killer{R}, if a given processor model is overclocked it is more likely to fail in some cases, let's call "critical path", and it is perfectly plausible to two users overclock the same model of a processor.

    The only thing I think is strange about this theory is the fact of the least significant byte or the second short instruction in decoding window being the "critical path", yet, still a plausible theory.

  15. mh says:

    Overclocking?  BUT IT WORKS WITH EVERYTHING ELSE!!!!

  16. > 00007fff`21af2c43 e8e0e40f00 call nosebleed!CNosebleed::OnFrimble

    Should be …::CanFrumble, surely

  17. C V says:

    Gotta wonder what the WER diagnostic checklist looks like.

    1) Check for HRESULT

    2) Check for exceptions in minidump

    100) Ask Raymond

  18. Danny says:

    Quote: "As to what sort of hardware failure could have occurred: This particular failure was reported twice, so a cosmic ray is less likely to be the culprit (because you have to get lightning to strike twice) than overheating or overclocking." Or this => http://www.zdnet.com/…/flipping-dram-bits-maliciously

  19. cheong00 says:

    @Danny: Except you're running kernel code, since all programs are now loaded in random address, there are gaps in between and that problem will most likely only affect your program.

    If you're writing kernel code, you probably wouldn't want to repetitively read from the same address on RAM (unless you're doing it on purpose and taken special measure, the location should have been loaded to cache on CPU so repetitive read to RAM on motherboard is not needed)

    So you can safely say that for most people, that is non-issue.

  20. Killer{R} says:

    There're no confidence what physical addresses and 'gaps' between them are used for particular process's working set. Furthermore – they continuously shuffled due to paging. So its a real security risk, – at least it gives probability to crash system (DoS) by code running as unprivileged account. More interesting if its possible to perform some 'predictable' corruption and say change some specific '0' to '1' in memory, that will give you root access or something like that :)

Comments are closed.

Skip to main content