Toolbar Compatibility Debugging Walkthrough

In the past I’ve found debugging walkthroughs useful for picking up new techniques. In that spirit here’s a quick rundown of a bug I was investigating today that may have some useful tidbits.

This was a crash in IE that involved a toolbar that I didn’t have the source code for. The issue was that if you clicked in the toolbar’s edit box and later closed the browser, IE would crash.

It was crashing while trying to call Release() on a pointer to the toolbar and initially looked like a reference counting issue, either in IE or the toolbar code itself. This type of bug can be tricky to track down in your own code, so given that this bug straddles legacy IE code and external toolbar code I closed my door and prepared for the worst. :-)

I started out by turning on full pageheap using gflags.exe, which is part of the standard debugging package, and repro’d the bug. This was to ensure that the crash wasn't a side-effect of heap corruption, and that I was debugging the right thing.

Next I put a breakpoint on the toolbar’s Release(). Since I don't have the source I had to track this down manually:

0:005> kP 1

ChildEBP RetAddr

01eaeab0 0074f7fb xxxxx!xxxx::_xxxxxxx(

                        struct IUnknown * ptb = 0x020f3940)

0:005> dds 0x020f3940

020f3940 10031b44 toolbar!DllMain+0x27d24

020f3944 10031b2c toolbar!DllMain+0x27d0c

020f3948 10031b18 toolbar!DllMain+0x27cf8

020f394c 10031af8 toolbar!DllMain+0x27cd8

020f3950 10031ad8 toolbar!DllMain+0x27cb8

020f3954 10031f50 toolbar!DllMain+0x28130

020f3958 00000003

  [...]

0:005> dds 10031b44

10031b44 1000cc90 toolbar!DllMain+0x2e70

10031b48 1000cdd0 toolbar!DllMain+0x2fb0

10031b4c 1000cdf0 toolbar!DllMain+0x2fd0

10031b50 1000ce20 toolbar!DllMain+0x3000

  [...]

I could have also unassembled the code and traced the logic, but I've found that it's often faster to just use "dds" to dump interesting-looking addresses. "dds" is especially useful for dumping the stack when symbols are incomplete (or the stack is corrupt) and for tracking down objects on on optimized builds where the debugger gets confused. (When you have symbols and dump an address it will be immediately obvious from the vtable whether you're looking at the right object.)

The IUnknown interface has three methods: QueryInterface(), AddRef(), and Release(), in that order. Given the dump of the vtable I assumed toolbar!DllMain+0x2fd0 was the Release() function and confirmed by unassembling it. It looked right, so I put a breakpoint on just before the return:

0:005> u toolbar!DllMain+0x2fd0

  [...]

1000ce15 8b06 mov eax,[esi]

1000ce17 5e pop esi

1000ce18 c20400 ret 0x4

1000ce1b cc int 3

0:005> bp 1000ce18

and then re-ran the repro. For brevity I’ve left out many of the calls and removed redundant output. ‘eax’ holds the return value of Release() so you can see that it’s winding down to the point of doing the final Release() (at which point the object will delete itself).

0:005> g

Breakpoint 1 hit

eax=00000004 ebx=020f39ec ecx=020f395c edx=00000850 esi=020f3d84 edi=00000000

eip=1000ce18 esp=01eaf4cc ebp=00000000 iopl=0 nv up ei pl nz na pe cy

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000203

toolbar!DllMain+0x2ff8:

1000ce18 c20400 ret 0x4

0:005> g

Breakpoint 1 hit

eax=00000003 ebx=014ab010 ecx=020f395c edx=10031b18 esi=0224000a edi=00000008

0:005> g

Breakpoint 1 hit

eax=00000002 ebx=014ab010 ecx=020f395c edx=10031b44 esi=0224000a edi=00000008

0:005> g

wn IEFRAME CDocObjectView::DestroyViewWindow(): Destroying Host Window

Breakpoint 1 hit

eax=00000001 ebx=00000000 ecx=020f395c edx=00803e30 esi=020f3d04 edi=020f9328

0:005> g

Unable to remove breakpoint 1 at 1000ce18, Win32 error 487

    "Attempt to access invalid address."

The breakpoint was set with BP. If you want breakpoints

to track module load/unload state you must use BU.

(564.db0): Access violation - code c0000005 (first chance)

First chance exceptions are reported before any exception handling.

This exception may be expected and handled.

Unable to remove breakpoint 1 at 1000ce18, Win32 error 487

    "Attempt to access invalid address."

The breakpoint was set with BP. If you want breakpoints

to track module load/unload state you must use BU.

Ah ha! This wasn’t what I was looking for, but you can see that before we do the final release -- or crash -- the debugger complains that a breakpoint is set in a module that has been unloaded. The crash happens shortly after this and is simply caused by trying call into the module after it’s been unloaded.

So why was it unloaded? Let’s put a breakpoint on the module unload and re-run the repro and find out:

0:005> sxe ud:toolbar

0:005> g

[...]

Breakpoint 1 hit
eax=00000001 ebx=00000000 ecx=020f395c edx=00803e30 esi=020f3d04 edi=020f9328
eip=1000ce18 esp=01eafa98 ebp=00000000 iopl=0 nv up ei pl nz na pe cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000203
toolbar!DllMain+0x2ff8:
1000ce18 c20400 ret 0x4

0:005> k

*** ERROR: Symbol file could not be found. Defaulted to export symbols for C:\WINDOWS\system32\kernel32.dll -

ChildEBP RetAddr

WARNING: Stack unwind information not available. Following frames may be wrong.

01eafcf4 7c80aa7f ntdll!KiFastSystemCallRet

*** ERROR: Symbol file could not be found. Defaulted to export symbols for C:\WINDOWS\system32\ole32.dll -

01eafd08 77513442 kernel32!FreeLibrary+0x19

01eafd14 77513456 ole32!CoFreeUnusedLibraries+0xa9

01eafeb8 77513578 ole32!CoFreeUnusedLibraries+0xbd

01eafec8 775133a2 ole32!CoFreeUnusedLibrariesEx+0x2e

01eafeec 007ab40f ole32!CoFreeUnusedLibraries+0x9

01eaffb4 7c80b50b xxxxx!xxx::_xxxxxxxx+0x3af

It’s being unloaded when IE’s code calls CoFreeUnusedLibrariesEx() when the window is closed. This is code I'm not super-familiar with, but I presume we’re doing it to trigger the unloading of DLLs for BHOs, toolbars ActiveX controls, and so on, to free up memory. However, we still have properly reference counted pointers to the toolbar so it shouldn’t be unloading quite yet.

According to MSDN CoFreeUnusedLibrariesEx() calls DllCanUnloadNow(), which is supposed to return S_FALSE if the DLL is not yet ready to be unloaded. Let’s set a breakpoint, step through the function, and see what it’s returning in this scenario:

0:005> bp toolbar!DllCanUnloadNow

0:005> g

  [...]

Breakpoint 0 hit

eax=00000000 ebx=00000001 ecx=77606074 edx=00000000 esi=014a6dd0 edi=77606068

eip=10008fd0 esp=01eafd20 ebp=01eafd30 iopl=0 nv up ei pl zr na po nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246

toolbar!DllCanUnloadNow:

10008fd0 8b0d3c1c0410 mov ecx,[toolbar!DllMain+0x37e1c (10041c3c)] ds:0023:10041c3c=00000000

0:005> p

[...]

0:005> p

eax=00000000 ebx=00000001 ecx=00000000 edx=00000000 esi=014a6dd0 edi=77606068

eip=10008fdd esp=01eafd20 ebp=01eafd30 iopl=0 nv up ei pl zr na po nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246

toolbar!DllCanUnloadNow+0xd:

10008fdd c3 ret

As you can see by looking at ‘eax’ this function is returning 0, which is S_OK. I believe this is the cause of the bug.

After reviewing recent changes that I had made in IE7 I found that one of them caused us to hold onto the toolbar object longer than we used to. In previous versions of IE we happened to always do the final Release() before calling CoFreeUnusedLibrariesEx(), masking the bug in the toolbar. The fix in this case, for better or worse, was to update the code so that we release earlier like we used to.

Thoughts? Are these types of walkthroughs interesting or useful? If so I’ll do more of them.