Minor heart attacks with x64 and Multithreading

There are very few things that I am afraid of. It’s not that I am brave or heroic – I am just lazy. Being afraid takes energy and can potentially require me to take action (even if the action is running away). I am just too damn lazy for that.

The things that do scare me, however, are:

  1. Bears – Just search for the word in the page. You will see what I mean.
  2. Multithreading problems under stress.
  3. MFC

 

I don’t think I need to explain No.1 and No. 3 – they are quite obvious. However, when code that I wrote or maintain starts hitting bugs in multi-threading areas, I get a little scared. Okay, not a little. It scares the crap out of me. Worse when it’s on a x64 machine which means that debugging is trickier than a plain x86 one. That’s why, when I write such code, I spend 90% of the time designing it and badgering people to review it and find faults etc.

 

Yesterday I had a minor heart attack when some of my code seemed to be deadlocking. One of the sure signs of a deadlock is when you timeout on a lock that’s never supposed to timeout. In most places in our code, we have a 30 minutes timeout on our locks (we use the Monitor class in various places in our code) so that at the very minimum, we will know we need to get rid of part of the system (the generated exception will cause us to get rid of some state, trying to keep the rest of the system healthy).

Now, since this problem is discovered under stress, there’s very little one can do to figure out what’s going on, save to add debugging code and try to get another repro (and debugging code is notorious for making deadlocks go away or changing the playing field enough so that one cannot really know if the problem one is seeing is the correct problem).

I always knew just how important logs are, but only with Excel Server did I start relying on them for day-to-day debugging chores. As I was looking through the logs for the TimeoutException I seemed to be getting, I found it and my heart sunk – it was indeed happening and, worse, it was originating in our code – nothing worse than not having a scapegoat.

That’s where I suffered a minor heart-attack. After getting feeling back in my left arm, I took a deeper look at the logs and noticed that the timing seemed to be off. As I mentioned, our timeout for locks is 30 minutes or so. The lock seemed to throw a TimeoutException after about 100ms. The pressure in my chest got a little better, but not by much.

On a hunch, I sent one of our internal distribution lists a question about this. Ten minutes after sending the email, a kind soul saved me from my internal turmoil – it was a known problem.on 64bit machines with the 64bit CLR, there was a QFE to fix it. I was saved. Hallelujah!