A couple of days before Christmas, I found myself in the emergency department of my local hospital with an Atrial Fibrillation. Think of it as feeling like you're running a marathon while you're lying still. To correct the problem, my doctor decided to knock me out and zap me to "reboot" my system. This parallel with my day job obviously struck a chord. Apparently just after I was sedated, I informed her that she needed to do a root cause analysis. Her response was something like "I think he needs more". Let's just say that the nurses were still laughing 10 minutes later when I woke up (with a normal rhythm).
This experience revealed to me how important getting to the bottom of a problem is (well at least in my mind). Many of the systems we build are used by large numbers of people. When they fail, these users have to live with the consequences. In the Windows world, we have this nasty habit of rebooting a server to "fix" a problem. A lot of the time this works and the phones stop ringing, but the chances are that the problem will come back. Some people write this off as "typical Windows". It's not! A .NET application running on Windows Server 2003 should be rock solid. If it isn't, something is definitely wrong and should be investigated.
I'm not about to start a masterclass on application diagnostics, but it's amazing what you can learn if you take the time to look. Windows itself can tell you a lot about an application through its various logs. That said, I still believe that adding some well considered instrumentation and exception management to an application is the best way to understand an application's runtime behaviour. If you put this in from day one, all the better.
Unfortunately we can't just add some instrumentation to produce Andrew 2.0 and find out why I needed jump-starting. But at least I'm looking.