Sample debugging session without symbols

I was asked to debug some code where we roughly knew what was going on in the source code, but we didn’t have access to the symbols. This gave me a good chance to dust off some old ASM knowledge, and work on the art of debugging without symbols.  It's not an ideal situation, but sometimes it is all you have to work with.

In order to protect some internal stuff, the details here have been altered to represent the core problem without disclosing sensitive material.

This code ended up producing an Access Violation exception. A null pointer was being dereferenced:

    hr = foo->bar(value1, static_cast<unsigned long>(value2))))

When I first saw the code, I assumed the foo variable was nullptr. However, it was a valid pointer. Where was the nullptr dereference being made?

The faulting call stack terminated with the following frames:

 0072dda0 100260f5 00b8737c 5e9365d8 00000091 0x0
0072dddc 10027f57 51be7d60 5197d7b8 5197d788 foo!baz+0x30e5
0072ddfc 100267de 00000000 00000000 00000000 foo!baz +0x4f47
0072de2c 10025c2e 0072de7c 0072de58 1d6507fc foo!baz +0x37ce
0072de98 10025951 00b8737c 0072dec0 0072e7e8 foo!baz +0x2c1e
0072e6c4 555fb5fb 008189f8 00b8737c 0072e914 foo!baz +0x2941

Using the windbg ‘ub’ command, I dumped out the x86 ASM that was executing up to the point of the exception:

 mov    eax,dword ptr [ebp-1Ch]
push   dword ptr [ebp-0Ch]
push   esi mov    eax,dword ptr [eax+4]
push   eax mov    ecx,dword ptr [eax]
mov    esi,dword ptr [ecx+30h]
mov    ecx,esi
call   esi // This is calling into the ‘bar’ method

Using the ChildEBP for the frame, the annotated instruction flow looks like:

 // ebp = 0072dddc (from faulting call stack)
mov     eax,dword ptr [ebp-1Ch] // eax = *(0072dddc - 0x1c) = *(0072ddc0) = 0072de7c = this
push    dword ptr [ebp-0Ch]     // push *(0072dddc - 0xc) = *(0072ddd0) = 00000091 = value2
push    esi                     // push esi = 0x5e9365d8 = value1
mov     eax,dword ptr [eax+4]   // eax = *(eax+4) = *(0072de7c+4) = *(0072de80) = 00b8737c = this->foo
                                // where: *this->foo = blah_555e0000!xyz+0x135fe
push    eax                     // push eax = 00b8737c
mov     ecx,dword ptr [eax]     // ecx = *(eax) = *(00b8737c) = 55604874 = blah_555e0000!xyz+0x135fe
mov     esi,dword ptr [ecx+30h] // esi = *(55604874+0x30) = *(556048a4) = 00000000
mov     ecx,esi
call    esi

The stack looked like it was set up correctly:

 0:000> dd 0072dda0
0072dda0 00000091 100260f5 00b8737c 5e9365d8     // blah_555e0000!xyz+0x135fe=5e9365d8
0072ddb0 00000091 5197d7b8 5197d788 00000000     // value2=0x91, value1=0x5197d7b8
0072ddc0 0072de7c 5e9365d8 51be7df8 00000000
0072ddd0 00000091 00000000 1d650460 0072ddfc
0072dde0 10027f57 51be7d60 5197d7b8 5197d788
0072ddf0 057684b8 00000000 00b873b0 0072de2c
0072de00 100267de 00000000 00000000 00000000
0072de10 00b83538 5197d788 0072de60 05750048

This code is dereferencing a vtbl offset. Thus, I suspected the contracts between the two sides of software were not in agreement. When I ran my local version of this code, where I know the 2 sides of software agree on the software contract, that same line of code that crashes instead set esi to be the entry instruction for blah!bar (where the class implementing the interface comes from the ‘blah’ module).

Using the windbg ‘dds’ command, I was able to dump out the symbolic vtbl for the class. This confirmed that the method in question was indeed at offset 0x30 from the base address:

 0:000> dds 00224e28
// … extraneous data omitted
00224e58 00240310 blah!bar

My suspicion turned out to be correct. The team that had the symbols was able to look up the vtbl, and confirmed that our software contracts were not in agreement.

Besides debugging skills, what we can learn from this in software principles and design?  When your software is interacting with external code, make sure you either have a contract that doesn't change, or has a solid protocol for negotiating the differences.