Fox TV here in the US has a show called “House”. Valorie and I started watching it sometime towards the end of the 2nd season, the 3rd season started last week. House stars Hugh Laurie as a genius drug addicted, lame doctor who, with his brilliant associates, finds the root cause of impossibly complicated diseases.
Each episode starts with someone arriving at the hospital with some mysterious ailment, and house and his impossibly pretty team go to work trying to diagnose the person’s problem. They almost always succeed and the patient goes home cured (with several notable exceptions).
Last week, I realized that aspects of my life are very similar to House’s (without the drug addition, the handicap, and the crazy-good looking sidekicks part (sorry folks, but nobody on the audio team quite matches House’s team, especially me :)). I’m also not the boss of the team, just a peon. One of the hallmarks of the show is that they perform a “differential diagnosis” – diagnosis based on the symptoms of the disease. Invariably their original diagnosis is almost always wrong, but they eventually find the root cause.
But there’s so much of my life that works like a House episode. Take last week.
One of the people on my team was looking at the Vista RC1 OCA information and noticed that we had a single crash bucket that had a significant number of hits in one of our components.
I took a look at the crash dump and immediately diagnosed a concurrency issue. I worked up a fix based on the call stack of the crash (by default OCA crash dumps contain the call stacks for the threads in the process and the registers and not too much more), and I was done. Nothing out of the ordinary.
I built the fix, verified it on my machine and started the checkin process (there are a number of steps that have to be taken for any checkin, including code reviews, test signoff, etc).
Unfortunately, I had this nagging feeling about my fix – the call stack didn’t have quite enough information to completely diagnose the problem – my fix would explain the crash, but if the problem was the one I thought it was, I would have expected that there would be side effects. Things didn’t quite add up (the doctors original diagnosis was wrong – the patient should have had other symptoms).
So I went and I asked the internal OCA web site to collect more information from our customers – I wanted a more detailed version of the crash dump that contained the contents of the heap (the doctors asked for more tests to be performed).
It didn’t take long (a day or so) for a couple of new occurances of the crash to be reported with the heap dumps. With the new info, I was quite surprised by what I saw (the new tests that the doctor ordered showed some data that both confirmed and disputed the diagnosis). The crash was occurring in code that looked like the following:
for (i = 0 ; i < class->cElements ; i += 1)
x = class->_ValueArray;
The crash was occurring when accessing _ValueArray. The code was:
move ecx, [esi]+24
move eax, [ecx]
The crash was occurring at the mov eax instruction, eax was 0. When I got the heap dumps, I saw that class->cElements was 8, and _ValueArray pointed to valid memory! I looked at the code, the _ValueArray value was located 24 bytes from the start of the class, so the problem wasn’t some wierd compiler issue. There was no question that the value was 0 at the time of the crash, but apparently the memory pointed to by ESI wasn’t 0 (the test results were inconclusive – they didn’t rule out the original diagnosis, but they didn’t confirm it).
So I went back for more information. One of the OCA options you can do is to ask the customer to fill out a survey which can be used to help diagnose the problem. I set up the crash bucket to ask the customers for a survey (the doctors went back and took a new version of the patient history).
Unfortunately, even with all this data, we still didn’t have confirmation that my original diagnosis was accurate (there was no additional information in the patient history). Bummer.
Fortunately, late on Thursday afternoon, I got an email from a tester in another part of the Windows organization. She had gotten this crash running this one series of tests and was wondering if anyone on our team wanted to look at it (the patients mother-in-law remembered something that was important).
It turns out that she had hit exactly the same bug that the customers had, and she had a live debugger attached to the machine, which meant that I could diagnose the problem directly. And on her machine, I saw the side effects I had expected to see in the crash dumps (the doctor’s eventually performed exploratory surgery and identified exactly the problem that was occurring, and saved the day).
I then talked to the guys who are responsible for the OCA reports. It turns out that the reason I didn’t see the expected side effects on the crash dumps was because of other services that live in the same process as our service. It turns out that because of those other services, the process of generating OCA crash dumps doesn’t preserve the entire state of the process at the time of the crash – some threads continue to run after the crash occurred. So the information for the current thread is completely accurate as of the time of the crash, other information in the process may not reflect the state of the process at crash time (the patient had another symptom that masked the expected side effects, complicating what would normally be a simple diagnosis).
Yeah, I know the House analogy is a bit tortured, but it was all I could think of while I was looking at the problem – “Darn it, my diagnosis is good, I know I found a problem, but I can’t tell if it’s the root cause or not”.