Hello my name is
One of the many pleasures I get from doing my job is the ability to diagnose a problem and discover its root cause. I love digging into things as deep as I can until the asm is staring me in the face exposing those nasty little bugs. Sometimes going that deep isn’t necessary, sometimes you have to get creative about how to diagnose an issue. This is especially the case when you’re dealing with complex multi-threading bugs.
One particular bug I looked into turned into a three day investigation. The net effect of the bug looked like this:
1. Open devenv
2. Load a solution that contains two C++ projects
3. Go to the Options dialog and set the “maximum number of parallel project builds” to 2 (I think this is the default if you’ve installed VS on a dual-processor machine)
4. Build the solution
Result is an Access Violation at VCProjectEngine.dll!CBldFileRegEntry::ReleaseFRHRef()
That’s right, the IDE just CRASHES! The worst possible behavior a customer could encounter (aside from data loss). And even better, this crash is not consistently reproducible; it might happen in 1 out of 100 times or not at all.
So the typical process is to get exact repro steps, a full callstack when the AV hits and to attach a dump to the bug. After some quick searching in our bug database I found a bug with a similar callstack logged by someone outside our team about one month prior. I was a bit concerned because this bug was resolved as No Repro and included full dump info!
Since I was encountering this problem on my multi-proc machine I decided to take a deeper look. First of all the callstack up to the AV looked a bit weird. Notice the function at the top of the stack.
Doesn’t resolve to symbol name and sure enough a check in the memory window shows this address is pointing into outer space. Here is the relevant source code at this point:
// Prevent access to registry by other threads.
if (m_nRefCount <= 0)
if (m_nRefCount == 0)
The AV occurred when the SafeDelete function was called. I hit this bug once and left the debug session open on my machine for the whole three days. Given the working set tax of debugging this was slowing down my machine for my normal day-to-day duties, but I had to leave it just in-case I couldn’t repro it again.
Drilling into this mysterious AV at a callsite, I thought it was bad code gen, here is the relevant asm for the ReleaseFRHRef function:
if (m_nRefCount <= 0)
5B00B14A mov eax,dword ptr [edi+18h]
5B00B14D test eax,eax
5B00B14F jbe CBldFileRegEntry::ReleaseFRHRef+32h (5B00B15Eh)
5B00B151 dec eax
5B00B152 mov dword ptr [edi+18h],eax
if (m_nRefCount == 0)
5B00B155 jne CBldFileRegEntry::ReleaseFRHRef+32h (5B00B15Eh)
5B00B157 mov eax,dword ptr [edi]
5B00B159 mov ecx,edi
5B00B15B call dword ptr [eax+8]
The last three instructions is the compiler generated code for a call through a vtable. Sure enough after debugging the code SafeDelete is the first virtual function call that we are making since we started executing code in this class. EDI is the this pointer, dereferencing that gives you the entry into the vtable. An offset of 8 on the vtable should give you back the address of the SafeDelete virtual function (since it’s the third vfunction). But as we can see indirecting that address gives the access violation. So someone trashed the contents of our object including the vtable.
Now the question is who trashed our object? This was beginning to smell like a ref counting problem. I started to look at how the CBldFileRegEntry class was being used by looking at its implementation and where it is referenced. This class was retro-fitted to allow for simultaneous access through different threads. This was done through a shared reference or handle where uses of this handle were protected by Critical Sections sprinkled throughout the code.
So all of this cool functionality but with random crashes? I had to take a deeper look at what was going on. The ref counting theory was beginning to make sense. If one thread had decremented the reference count too early another thread might think there are no references left to the shared object and delete it. While another thread is still using the object – and the next access to the object’s contents results in an AV right in the middle of the code. The fact that we’re seeing an AV is a good thing – Visual C++ 2005 will dirty the memory if you delete an object (filling it with sequences of 0xfeeefeee) imagine if that wasn’t the case – you wouldn’t see any crash just random incorrect behavior.
There wasn’t much you could do with the information in the bug – if it was not easily reproducible you could not rerun the scenario and walk through the code. And with the dump that was attached you could only see the *end* result – the thread that did the damage is long gone.
So I took a different approach and extracted the relevant pieces of the code into my own simulation. This was easy since the class was pretty self-contained, any references or dependencies to things that weren’t important I removed. Then I created a housing for the code – lifted straight off this MSDN article. It was a simple app that created threads through the CreateThread() Win32 API. I set the number of threads to 30. And in my ThreadProc I added the code that accesses the shared resource the same way as the CBldFileRegEntry code. Sure enough I hit the AV like once every other run. So now I had a very consistent repro!
With this little setup I was able to start closing in on the bug by using a binary search technique. I put a big Critical Section block around the first half of the code and left the rest as is. I ran the scenario and it didn’t repro (since that whole piece of code was blocking all threads!). Since it didn’t hit I tried the second half and it started to repro. Now that I had the right half I continued the process until I was down to a function level. Here was the suspect function:
BldFileRegHandle CBldFileRegFile::GetFileHandle(LPCOLESTR szFileName, BOOL bVerifyCase)