There are things worse than crashing


 

It’s annoying when an application crashes, but there are worse things. In my opinion, for end-user applications (and not mission critical apps) :

The best thing is for the program to just work as expected.
a crash is better than data-corruption: When a program crashes, you can at least restart.  Data-corruption can cause much larger loss.
– debuggable crashes are better than non-debuggable crashes. A crash that catches that occurs immediately at the point of the bug is generally pretty easy to triage. (This is what Watson does.) You’ve got a callstack pointing to a culprit red-handed.  This is like catching a criminal right on the crime scene. In contrast, sometimes a bug corrupts state and the program doesn’t actually crash until much later. In this case, it may be very difficult to determine the original bug from the crash. 
I derive this from my believe to optimize for simplicity. A debuggable crash is more likely to get fixed than a non-debuggable one and thus go away.  (Rick plus enough Watson bugs have influenced my thinking here).
a crash is better than a deadlock. When a deadlock occurs, you sometimes wonder if the UI is just temporarily hung and if it’s coming back. A crash doesn’t have the suspense.  Also, crashes generally have a single callstack pointing to the immediate culprit. Deadlocks (especially ones that aren’t just lock based) may be harder to assign blame.

 

To summarize the above in list form, I’d say :

  1. (Best): Application works as expected.
  2. Mainlines scenarios work as expected. (Eg, bugs exists, but they’re in such rare corner cases, nobody really notices or cares)
    (big gap)
  3. Application crashes immediately at a bug.  This is generally easy to triage (and therefore hopefully fix).
  4. Application Deadlocks. Usually easy to triage.
  5. Application crashes long after the relevant bug.   This is usually hard to triage and determine what the original bug was.
  6. (Worst): Serious data-loss or corruption

 

Practical design Principles?

I think there are a few practical design principles that come from this.
1) There’s a tension between #3 (crash early) and #5 (crash late). If your program detects some invariant is fatally broken, how hard should you try to recover? If you can reasonable recover and avoid the crash and get back to a sane state, then great – do that. But if you really can’t recover and are just postponing the crash, then keep it simple and crash sooner rather than later.

2) If your program operates on some large data-file, ensure that the program never puts the file into an inconsistent state in case the program crashes before it restores the file.  (Outlook 2007 is really great about this. Despite all the Outlook crashes, it has never corrupted my inbox).


Comments (10)

  1. Mihailik says:

    There is a problem.

    Cases 3, 4 and 5 are almost the same for end-user.

    I agree, the difference might be of paramount importance for a developer. But still, the logistics take usually more time than actual bug fix. End-user doesn’t care about efforts were needed to fix a bug. The whole release procedure, retesting, schedule aligning and hotfix shipping takes huge time comparing to actual development.

    Which means, fast-fail strategy doesn’t give benefits to end-user. Although postponing the crash obviously is a value for end-user.

    That is why I say your picture is too extreme.

    Even exception swallowing may be the right strategy if we value end-users’ satisfaction more than development simplicity.

  2. jmstall says:

    Everything’s about balance. I’m not saying remove all the catch blocks from your code!

    I think this sums up my opinion on where the balance is:

    "If you can reasonable recover and avoid the crash and get back to a sane state, then great – do that"

  3. davkean says:

    Oleg,

    That’s assuming you know what kind of state you are in. It is a recipe for disaster if you handle (and ignore) every exception without knowing what it is. For example, delying the crash could actually mean crashing again the very next thing the user does (this is a worse user experience). Or even worse, your application (or a library it is using) could have written to a bad pointer, overwriting the section of memory that contains the document the user is currently working on. The next save of the document will cause data curruption.

    Crashing is not the worst thing in the world – if you are hooked up to Watson (or some other Error reporting service), then you will find these bugs earlier in testing, or if they do make it out to the customer then you can fix these bugs next release (or in a service pack) – making the product actually better off for the user.

  4. Oleg Mihailik says:

    I like that one about balance. Absolutely!

    You know, you are working in Microsoft, you probably can press to other developers who has written your dependencies.

    On other side, there is plenty of bad-written libraries which we just doomed to use sometimes. I cannot just crash when my dependency misbehave. I cannot just blame my dependency when I talk to my customers. I have to work around it. Sometimes it means catching unknown exceptions despite any idea how to recover.

  5. Norman Diamond says:

    Thank you very much for your understanding of the position of #6.  I hope more of your colleagues will understand it in the future.

    I’d like to suggest a few insertions between #5 and #6:

    Mainline scenarios work as expected, but bugs cause incorrect output (without adding damage to existing documents).  Examples would be Word which displays with wrong fonts, wrong section numbers, mispositioned images, etc.

    Program does what it should but also joins a botnet and pumps out spam.

    Program says it succeeded but it didn’t really.  Worst example would be a backup program that says it made a backup but it didn’t really.

  6. Programming says:

    Here's an interesting thought question from Mike Stall: what's worse than crashing? Mike provides

  7. Norman Diamond says:

    David ‘daqq’ Gustafik reminded us of a #7, worse than #6.  He posted it in a comment on Jeff Atwood’s blog at

    http://www.codinghorror.com/blog/archives/000924.html#comments

    > 7. Application crash/error causes harm or

    > physical damage (for instance through the

    > machinery it controls, or in medical devices,

    > or for instance a power outage)

  8. Mark Lindell says:

    From Pragmatic Programmer:  "Crash Early"

    Also..

    "Test your software or your users will"

    My twist is…

    "Test your exceptions, …your users will!"

    Exception testing is a must for all unit testing and integration testing.  It almost never happens.

  9. CoqBlog says:

    Dans la série le-post-que-je-ne-retrouve-jamais-quand-je-le-cherche, celui de Mike Stall : There are