I was asked to debug some code where we roughly knew what was going on in the source code, but we didn’t have access to the symbols. This gave me a good chance to dust off some old ASM knowledge, and work on the art of debugging without symbols. It's not an ideal situation, but sometimes it is all you have to work with.
In order to protect some internal stuff, the details here have been altered to represent the core problem without disclosing sensitive material.
This code ended up producing an Access Violation exception. A null pointer was being dereferenced:
hr = foo->bar(value1, static_cast<unsigned long>(value2))))
When I first saw the code, I assumed the foo variable was nullptr. However, it was a valid pointer. Where was the nullptr dereference being made?
The faulting call stack terminated with the following frames:
0072dda0 100260f5 00b8737c 5e9365d8 00000091 0x0 0072dddc 10027f57 51be7d60 5197d7b8 5197d788 foo!baz+0x30e5 0072ddfc 100267de 00000000 00000000 00000000 foo!baz +0x4f47 0072de2c 10025c2e 0072de7c 0072de58 1d6507fc foo!baz +0x37ce 0072de98 10025951 00b8737c 0072dec0 0072e7e8 foo!baz +0x2c1e 0072e6c4 555fb5fb 008189f8 00b8737c 0072e914 foo!baz +0x2941
Using the windbg ‘ub’ command, I dumped out the x86 ASM that was executing up to the point of the exception:
mov eax,dword ptr [ebp-1Ch] push dword ptr [ebp-0Ch] push esi mov eax,dword ptr [eax+4] push eax mov ecx,dword ptr [eax] mov esi,dword ptr [ecx+30h] mov ecx,esi call esi // This is calling into the ‘bar’ method
Using the ChildEBP for the frame, the annotated instruction flow looks like:
// ebp = 0072dddc (from faulting call stack) mov eax,dword ptr [ebp-1Ch] // eax = *(0072dddc - 0x1c) = *(0072ddc0) = 0072de7c = this push dword ptr [ebp-0Ch] // push *(0072dddc - 0xc) = *(0072ddd0) = 00000091 = value2 push esi // push esi = 0x5e9365d8 = value1 mov eax,dword ptr [eax+4] // eax = *(eax+4) = *(0072de7c+4) = *(0072de80) = 00b8737c = this->foo // where: *this->foo = blah_555e0000!xyz+0x135fe push eax // push eax = 00b8737c mov ecx,dword ptr [eax] // ecx = *(eax) = *(00b8737c) = 55604874 = blah_555e0000!xyz+0x135fe mov esi,dword ptr [ecx+30h] // esi = *(55604874+0x30) = *(556048a4) = 00000000 mov ecx,esi call esi
The stack looked like it was set up correctly:
0:000> dd 0072dda0 0072dda0 00000091 100260f5 00b8737c 5e9365d8 // blah_555e0000!xyz+0x135fe=5e9365d8 0072ddb0 00000091 5197d7b8 5197d788 00000000 // value2=0x91, value1=0x5197d7b8 0072ddc0 0072de7c 5e9365d8 51be7df8 00000000 0072ddd0 00000091 00000000 1d650460 0072ddfc 0072dde0 10027f57 51be7d60 5197d7b8 5197d788 0072ddf0 057684b8 00000000 00b873b0 0072de2c 0072de00 100267de 00000000 00000000 00000000 0072de10 00b83538 5197d788 0072de60 05750048
This code is dereferencing a vtbl offset. Thus, I suspected the contracts between the two sides of software were not in agreement. When I ran my local version of this code, where I know the 2 sides of software agree on the software contract, that same line of code that crashes instead set esi to be the entry instruction for blah!bar (where the class implementing the interface comes from the ‘blah’ module).
Using the windbg ‘dds’ command, I was able to dump out the symbolic vtbl for the class. This confirmed that the method in question was indeed at offset 0x30 from the base address:
0:000> dds 00224e28 // … extraneous data omitted 00224e58 00240310 blah!bar
My suspicion turned out to be correct. The team that had the symbols was able to look up the vtbl, and confirmed that our software contracts were not in agreement.
Besides debugging skills, what we can learn from this in software principles and design? When your software is interacting with external code, make sure you either have a contract that doesn't change, or has a solid protocol for negotiating the differences.