Resilience is NOT necessarily a good thing

Article
05/01/2008

I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow. I'm not going to mince words: What a profoundly stupid assertion to make. Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them. In my experience handling exceptions and attempting to continue is a recipe for disaster. At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve. At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option). I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control). In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive. This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the logs just consume disk space).

But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run. Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state. Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program. Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem. Any other attempt at continuing is a recipe for disaster.

-------

[1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions. For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception. But if you don't know the exact cause of the exception you should never proceed. For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok. For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

Edit: Cleaned up wording in the footnote.

Resilience is NOT necessarily a good thing

Additional resources