Minor heart attacks with x64 and Multithreading


There are very few things that I am afraid of. It’s not that I am brave or heroic – I am just lazy. Being afraid takes energy and can potentially require me to take action (even if the action is running away). I am just too damn lazy for that.


The things that do scare me, however, are:



  1. Bears – Just search for the word in the page. You will see what I mean.

  2. Multithreading problems under stress.

  3. MFC

 


I don’t think I need to explain No.1 and No. 3 – they are quite obvious. However, when code that I wrote or maintain starts hitting bugs in multi-threading areas, I get a little scared. Okay, not a little. It scares the crap out of me. Worse when it’s on a x64 machine which means that debugging is trickier than a plain x86 one. That’s why, when I write such code, I spend 90% of the time designing it and badgering people to review it and find faults etc.


 


Yesterday I had a minor heart attack when some of my code seemed to be deadlocking. One of the sure signs of a deadlock is when you timeout on a lock that’s never supposed to timeout. In most places in our code, we have a 30 minutes timeout on our locks (we use the Monitor class in various places in our code) so that at the very minimum, we will know we need to get rid of part of the system (the generated exception will cause us to get rid of some state, trying to keep the rest of the system healthy).


Now, since this problem is discovered under stress, there’s very little one can do to figure out what’s going on, save to add debugging code and try to get another repro (and debugging code is notorious for making deadlocks go away or changing the playing field enough so that one cannot really know if the problem one is seeing is the correct problem).


I always knew just how important logs are, but only with Excel Server did I start relying on them for day-to-day debugging chores. As I was looking through the logs for the TimeoutException I seemed to be getting, I found it and my heart sunk – it was indeed happening and, worse, it was originating in our code – nothing worse than not having a scapegoat.


That’s where I suffered a minor heart-attack. After getting feeling back in my left arm, I took a deeper look at the logs and noticed that the timing seemed to be off. As I mentioned, our timeout for locks is 30 minutes or so. The lock seemed to throw a TimeoutException after about 100ms. The pressure in my chest got a little better, but not by much.


On a hunch, I sent one of our internal distribution lists a question about this. Ten minutes after sending the email, a kind soul saved me from my internal turmoil – it was a known problem.on 64bit machines with the 64bit CLR, there was a QFE to fix it. I was saved. Hallelujah!

Comments (2)

  1. tzagotta says:

    Why does MFC scare you. Lots of MS development tool customers use that for their critical applications all day long.

  2. Shahar says:

    Every run-in I had with MFC resulted in heart-ache. More than any other technology. Granted, I stopped using it about 6 years ago, and since then MFC7 and 8 came out – so I cannot comment about them.

    Just to give a couple of very small examples..

    1. We had a relatively big app written in MFC back in 1997 (fraud detection on international #7 operators). The front-end was written in MFC and would communicate with the backend via sockets. We used CSocket to facilitate the communications on the frontend side. We ran into insane problems which were incredibly hard to debug. At the end, we realized that CSocket was, in some cases, bound to the thread from which it was created and in some cases would use the TLS to grab information it needed. It worked.. Most of the time, because the information happened to be the same on some threads. But it was not documented anywhere that the classes were bound to threads.

    2. In 1998, I started writing an MFC ActiveX that was used in a much larger application. I also wrote numerous ATL ActiveXs and we even had a VB6 control or two. Throughout the whole time, the MFC implementation was, by far, the one that gave us the most grief and was incredibly hard to debug (though the VB one was harder to debug, it also gave us way less trouble though).

    I have had many other problems with it – I was using it heavily from ~1996 to 2000 or so – enough problems to make me actually afraid of it.

    Dont get me wrong – for very simple UI apps I would not tell people not to use it, but for anything that is even marginally complex… No thanks. "A burnt child dreads the fire."

    When using ATL, for example, I can tell you that 1/2 my serious/unsolveable problems were due to me using the thing incorrectly, agaisnt the docs and 1/2 my problems were due to ATL’s problems. With MFC, the percentages are much more on MFCs side.

    On the bright side, MFC was much better written than OWL. 🙂 At least it did not use overrided += and *= for building menu entries!

    !!!!!!! DISCLAIMER !!!!!!!!

    As I said at the top, I am mainly talking about MFC4.x (I also did a bit of MFC2.0 back in.. 94?) I dont know how good/bad MFC7/8 is.