Crash dummies: Resilience


I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it’s better to crash and let Watson report the error than it is to catch the exception and try to correct it.



Eric Aside


A lot of people have been flipping out (see comments below) about the statement that you should catch the exception. The more thoughtful readers point out security concerns with handling exceptions and the dangers of continuing an application with corrupted state. I couldn’t agree more. If the failure or exception leaves the program compromised you can’t simply continue. My point is just failing and giving up is wrong for users. One solution I talk about below is to fail, report, and reboot (restart) the application, like Office now does.


Watson is the internal name for the functionality behind the Send Error Report dialog box you see when an application running on Microsoft Windows crashes. (Always send it; we truly pay attention.)


From a technical perspective, there is some sense to the strategy of allowing the crash to complete and get reported. It’s like the logic behind asserts—the moment you realize you are in a bad state, capture that state and abort. That way, when you are debugging later you’ll be as close as possible to the cause of the problem. If you don’t abort immediately, it’s often impossible to reconstruct the state and identify what went wrong. That’s why asserts are good, right? So, crashing is sensible, right?



Eric Aside


An assert is a programming construct that checks if a relationship the programmer believes should be true is actually true. If it isn’t true, assert implementations typically abort the program when debugging, and log an error when running in production. Asserts are commonly used to check that parameters to a function are properly formed and to check that object states are consistent.


Oh please. Asserts and crashing are so 1990s. If you’re still thinking that way, you need shut off your walkman and join the twenty-first century, unless you write software just for yourself and your old school buddies. These days, software isn’t expected to run only until its programmer got tired. It’s expected to run and keep running. Period.


Struggle against reality


Hold on, an old school developer, I’ll call him “Axl Rose,” wants to inject “reality” into the discussion. “Look,” says Axl, “you can’t just wish bad machine states away, and you can’t fix every bug no matter how late you party.” You’re right, Axl. While we need to design, test, and code our products and services as error-free as possible, there will always be bugs. What we in the new century have realized is that for many issues it’s not the bugs that are the problem—it’s how we respond to those bugs that matters.


Axl Rose responds to bugs by capturing data about them in hopes of identifying the cause. Enlightened engineers respond to bugs by expecting them, logging them, and making their software resilient to failure. Sure, we still want to fix the bugs we log because failures are costly to performance and impact the customer experience. However, cars, TVs, and networking fail all the time. They are just designed to be resilient to those failures so that crashes are rare.


Perhaps be less assertive


“But asserts are still good, right? Everyone says so,” says Axl. No. Asserts as they are implemented today are evil. They are evil. I mean it, evil. They cause programs to be fragile instead of resilient. They perpetuate the mindset that you respond to failure by giving up instead of rolling back and starting over.


We need to change how asserts act. Instead of aborting, asserts should log problems and then trigger a recovery. I repeat—keep the asserts, but change how they act. You still want asserts to detect failures early. What’s even more important is how you respond to those failures, including the ones that slip through.



Eric Aside


Just once more for emphasis—using asserts to detect problems early is good. Using asserts to avoid having to code against failures is bad.


If at first you don’t succeed


So, how do you respond appropriately to failure? Well, how do you? I mean, in real life, how do you respond to failure? Do you give up and walk away? I doubt you made it through the Microsoft interview process if that was your attitude.


When you experience failure, you start over and try again. Ideally, you take notes about what went wrong and analyze them to improve, but usually that comes later. In the moment, you simply dust yourself off and give it another go.


For Web services, the approach is called the five Rs—retry, restart, reboot, reimage, and replace. Let’s break them down:


§  Retry. First off, you try the failed action again. Often something just goofed the first time and it will work the second time.


§  Restart. If retrying doesn’t work, often restarting does. For services, this often means rolling back and restarting a transaction; or unloading a DLL, reloading it, and performing the action again the way Internet Information Server (IIS) does.


§  Reboot. If restarting doesn’t work, do what a user would do, and reboot the machine.


§  Reimage. If rebooting doesn’t work, do what support would do, and reimage the application or entire box.


§  Replace. If reimaging doesn’t do the trick, it’s time to get a new device.


Welcome to the jungle


Much of our software doesn’t run as a service in a datacenter, and contrary to what Google might have you believe, customers don’t want all software to depend on a service. For client software, the five Rs might seem irrelevant to you. Ah, to be so naïve and dismissive.


The five Rs apply just as well to client and application software on a PC and a phone. The key most engineers miss is defining the action, the scope of what gets retried or restarted.


On the Web it’s easier to identify—the action is usually a transaction to a database or a GET or POST to a page. For client and application software, you need to think more about what action the user or subsystem is attempting.


Well-designed software will have custom error handling at the end of each action, just like I talked about in my column “A tragedy of error handling” (which appears in Chapter 6 of my book). Having custom error handling after actions makes applying the five Rs much simpler.


Unfortunately, lots of throwback engineers, like Axl Rose, use a Routine for Error Central Handling (RECH) instead, as I described in the same column. If your code looks like Axl’s, you’ve got some work to do to separate out the actions, but it’s worth it if a few actions harbor most crashes and you aren’t able to fix the root cause.


Just like starting over


Let’s check out some examples of applying the five Rs to client and application software:


§  Retry. PCs and devices are a bit more predictable than Web services, so failed operations will likely fail again. However, retrying works for issues that fail sporadically, like network connectivity or data contention. So, when saving a file, rather than blocking for what seems like an eternity and then failing, try blocking for a short timeout and then try again—a better result for the same time or less. Doing so asynchronously unblocks the user entirely and is even better, but it might be tricky.



Eric Aside


Care should be taken when retrying an action. Some APIs and components already have retries built into them. Be sure to understand the behavior of components you use in advance or suffer from compounding repetition caused by leaky abstraction.


§  Restart. What can you restart at the client level? How about device drivers, database connections, OLE objects, DLL loads, network connections, worker threads, dialogs, services, and resource handles. Of course, blindly restarting the components you depend upon is silly. You have to consider the kind of failure, and you need to restart the full action to ensure that you don’t confuse state. Yes, it’s not trivial. What kills me is that as a sophisticated user, restarting components is exactly what I do to fix half the problems I encounter. Why can’t the code do the same? Why is the code so inept? Wait for it, the answer will come to you.


§  Reboot. If restarting components doesn’t work or isn’t possible due to a serious failure, you need to restart the client or application itself—a reboot. Most of the Office applications do this automatically now. They even recover most of their state as a bonus. There are some phone and game applications that purposefully freeze the screen and reboot the application or device in order to recover (works only for fast reboots).


§  Reimage. If rebooting the application doesn’t work, what does product support tell you to do? Reinstall the software. Yes, this is an extreme measure, but these days installs and repairs are entirely programmable for most applications, often at a component level. You’ll likely need to involve the user and might even have to check online for a fix. But if you’re expecting the user to do it, then you should do it.


§  Replace. This is where we lose. If our software fails to correct the problem, the customer has few choices left. These days, with competitors aching to steal away our business, let’s hope we’ve tried all the other options first.


Let’s not be hasty


Mr. Rose has another question, “Wait, we can’t just unilaterally take these actions. Customers must be alerted and give permission, right?” Well Axl, that depends.


Certainly, there are cases where the customer must provide increased privileges to restart certain subsystems or repair installs. There are also cases when an action could be time consuming or have undesirable side effects. However, most actions are clear, quick, and solve the problem without requiring user intervention. Regardless, the key word here is “action.”


There’s no point in alerting the user about anything unless it’s actionable. That goes for all messages. What’s the point of telling me an operation failed if there’s no action I can take to fix it or prevent it from happening again? Why not just tell me to put an axe through the screen? If there is a constructive action I can take, why doesn’t the code just take it? And we have the audacity at times to think the customer is dumb? Unbelievable.


It’s always the same


“Fine, this is extra work though,” complains Axl, “and who says the software won’t just be retrying, restarting, rebooting, and reimaging all the time? After all, if the bug happened once, it will happen again.” Actually Axl, bugs come in two flavors—repeatable and random. Some people call these Bohrbugs and Heisenbugs, respectively.



Eric Aside


The terms Bohrbug and Heisenbug date back before the 1990s. Jim Gray talked about them in a 1985 paper, “Why Do Computers Stop and What Can Be Done About It?”


Using the five Rs will resolve random bugs, rendering them almost harmless. However, repeatable bugs will repeat, which is why logging these issues is so important. Even if the program or service doesn’t crash, we still want the failure reported so we can recognize and repair the repeatable bugs, and perhaps even pin down the random bugs. The good news is that the nastiest bugs in this model, the repeatable ones, are by far the easiest to fix.


By putting in some extra work, we can make our software resilient to failure even if it isn’t bug-free. It will just appear to be bug-free and continue operating that way indefinitely. All it takes is a change in how we think about errors—expecting, logging, and handling them instead of catching them. Isn’t that worth the kudos (and free time) you’ll get from family and friends when our software just works? Welcome to the new world.



Eric Aside


I don’t expect this new approach to happen tomorrow. It’s a big change, particularly in the client and application areas. It used to be that only geeks had computers, so users knew how to restart and repair drivers. Now, everything just has to work with little or no user intervention. Part of the solution is higher engineering quality, but that only goes so far. There will always be failures even if the code is bug free. Resilience to failure is the clear next step.

Comments (36)

  1. Matt Thalman says:

    The problem is that taking action to recover automatically might be hiding a bug that could have been found during development if the classic Watson crash had occurred.  So during development, it’s important to have some sort of logging when one of these five Rs occurs.  Tests should verify the existence or absence of any log entries.

  2. I just ran into this post by Eric Brechner who is the director of Microsoft’s Engineering Excellence

  3. asymtote says:

    Wow, this is a classic piece of astronaut architect thinking. If an exception occurs that the application does not fully understand how can it safely do anything other than terminate?

    When the truly unexpected happens all you know is that your internal state is inconsistent. Sticking your fingers in your ears, shouting "LALALALALA" and trying the same thing again could easily result in disaster. If I jumped in the deep end of the pool and found I couldn’t swim but I didn’t completely drown you’d presumably tell me to have another go and see what happens.

    Terminating the app and restarting is the only way to get back into a consistent state.

    I’d rather be "thowback" engineer and ship real software rather than a Fanciful Phil up in the stratosphere.

  4. Experienced developers everywhere says:

    A asinine comment like this from "director of engineering learning and development for Microsoft Corporation".  That’s really scary!

  5. Axl says:

    I couldn’t disagree more with this particular column regarding assertions.

    It is like the author hasn’t seen an assertion for years and makes statements about it based on a good motivation that Resilience is.

    If you assert that something should not be null because it is instantiated somewhere else, for instance, the 5 R’s don’t seem to apply.

    An assertion is an improved comment. You usually have it specifically not to deal with a situation you don’t expect, so you don’t raise complexity/diminish test coverage. Placing an assertion without code to explicitly deal with it is the point of assertions.

    The principal of only dealing with situations you know about, seems much safer than the overall planning for the unknown approach.

    Resilience and the 5 R’s themselves are not a bad idea, but applying it for assertions is not the best approach. Maybe a feel well picked exception handlers might be a better spot to discuss resilience.

  6. B says:

    There is a distinction to be made between:

    a) I got an exception from some call I made, in which case:

    a.1) It is an expected result that I can wrap and provide alternatives or palliatives for it (try again, inform the user, try something else, etc).  This is the exception (pun!).

    a.2) You have no clue or there really is nothing that can be done … crashing seems to be the proper solution.  Restarting the app and possibly recovering state is a very nice-to-have and goes a long way to manage perceptions, but ultimately, you crash the app.

    b) The input from consumer code or output from dependant code does not meet consistency requirements (In/Out validation) … you’re hosed … this is a hard-core code defect.  There usually is little point in retrying faulty routines (especially with the same state), they tend to be faulty 🙂  Ignoring validation / consistency rules is a sure recipe for disaster including (but not limited to) security threats and irrecoverable data corruption.  Crashing seems wise and declaring the sub-feature unusable a probable conclusion (until fixed).

  7. Kinshuman says:

    A very sad post. Hope people do not follow the suggetion of trying to continue when in known bad states, at least one security vulnerability has been exploited because of a component’s desire to do so.

  8. sedmison says:

    At the risk of making a generalization here, I would say that it’s really hard to make generalizations about errors.

    Sure, there are errors from which recovery is possible, or even expected.  Yes, encountering a network glitch might be intermittent, and retrying the operation that failed because of that glitch is probably a good thing for the user (much better than just bailing).

    On the other hand, there are errors that truly represent catastrophic and unrecoverable failure.  If your app just jumped off into random memory, every instruction that you execute could now be completely random, could be further corrupting memory, or worse could even have been injected by an attacker and could now be doing his bidding to install a rootkit.  By definition (or simple tautology), if your code has an error that it doesn’t know how to handle, it can’t possibly handle that error.  If your code has a bug in its error handling, then the error handling code itself cannot handle the bug in the error handling code.  Similarly, if your code itself has been corrupted in memory or on disk, your code can’t trust itself to do any kind of recovery, and the best and only thing it should try to do is EXIT, POST HASTE!  Flailing around running more code at the point of failure could very well be making things worse.

    There has to be some halting-problem-style argument to be made here, but even without it, I would hope that everyone can see the obviousness of the paradox that one thing your code can’t code against is an error in its handling of errors.  And since code lives in components (the CPU and RAM) that are subject to corruption by other code, manufacturing defects, voltage spikes, heck, even cosmic rays, even code that was written to be bug-free can find itself in an unexpected state if the code itself is somehow damaged.

    Now, I would agree that programmers in general and Win32 in particular don’t always make these kinds of distinctions, and different functions (or even the same function, by virtue of handing its callees’ return codes to its callers) can use the same error codes to represent recoverable AND non-recoverable error conditions.  So I do agree that there are plenty of places where being more robust against errors is a good thing, and programmers could benefit from looking through your five R’s.  I would also agree that there are places where programmers use assertions when they really should have had error handling and resilience instead.  Those practices should be stopped.  We should all do better, more careful thinking about what we return from our APIs, and what we should do with the return codes from the APIs that we call.

    HOWEVER, with that said, I still go back to the main point that there are errors and states that a program just is not prepared to handle, and it is better off in those situations to just concede defeat and phone home (via Watson) for help.  It’s criminally irresponsible to try to plow ahead despite evidence of catastrophic error, memory or stack corruption, etc.

  9. theelvez says:

    Another point here is that you are at the mercy of your environment. This advice will never work in kernel mode for instance. You can’t handle unexpected exceptions in kernel mode – they always lead to a BSOD. I always advise that you crash early and fix the problem. The other thing to consider is that – although your component may keep chugging along when you swallow exceptions, you may be worsening the environment for other components that are co-existing with (i.e. orphaning locks, leaking memory, corrupting data, etc.).

  10. ouch says:

    This post hurts my brain.  Please drop down to being a dev for a while and let someone with enough distinguished experience take the reigns of this role.

  11. Provocazioni, gestione delle eccezioni, e usabilit

  12. Adrian says:

    I think people are reacting to the provocativeness of the opening rather than to what Eric’s actually saying.

    There are errors you understand (which may or may not be signaled as exceptions).  For example, a network glitch or a missing file.  There are others that you cannot understand (often signaled as exceptions).  For example, the process heap is corrupted.

    Retrying is a perfectly good way to handle possibly transient errors that you understand.

    Restart, Reboot, Reimage, and Replace are ways to get back to a known good state when you encounter  errors that indicate you’re in an invalid state.

    Crashing and getting a Watson report will help you find and fix bugs.  But logging and working through the five Rs will help you find and fix bugs AND keep your system up and running.

    If you’re running a mission critical service, you can’t wait for your ops people to restart a process or a server.  You have to notify them that there’s a problem and then restart automatically.

    If you’re running a life-support machine, you don’t just crash and hope the doctor or nurse notices.  You sound the alarm AND restart automatically.

  13. ericbrec says:

    I’m glad people find this latest blog so provocative. As usual, that’s the point.

    I am getting some common misunderstandings I should clarify.

    * I strongly believe in asserts. The issue is using asserts as a crutch and not actually handling errors.

    * My point overall is to recover on behalf of the user whenever possible. If the failure is severe, you’ll need to reboot (restart) the application, logging the fault in the process the way Office does. If the failure is relatively minor you can retry an action or restart a component. We don’t do that enough, and the user is left to figure it out.

  14. My boss is taking a little bit of heat for his latest blog post . The entire point of his blog is to

  15. sedmison says:

    Okay, but you claim up above that the way asserts are implemented is evil, and that’s what I was reacting to.  Overusing asserts is evil.  Using asserts when you intended error checking is evil.  For instance, this

    HRESULT hr = SomeOperation();

    assert( SUCCEEDED( hr ) );

    is evil if SomeOperation can return errors that are recoverable.  Instead, the programmer should have had some error checking on hr to do better recovery.

    But you can’t make a claim like "asserts are evil" and then wonder why people react.  If your argument was that using asserts as a crutch is evil, you should have spelled that out rather than starting from the premise that asserts themselves are evil.

  16. MSDNArchive says:

    Sean: you’re exactly right.  Too often I’ve seen this pattern:

    pObj = CreateSomeObject();

    ASSERT(NULL != pObj);

    pObj->DoSomething();

    which I claim is also evil, even though it’s a different shade of grey.  This isn’t an internal contract violation.  It isn’t even an unknown state.  I failed to obtain the object I wanted to obtain.  The reasons can vary; depending on the object, it could simply be an out of memory situation.  Too many programmers say "Out of memory?  All bets are off."  I say, this is a failure that should be built into your failure model.  Figure out what you want your code to do in this situation.  Graceful degradation is certainly an option.

  17. Steve says:

    Asserts should only be used for conditions that can never happen by design.  They must never be used for errors that can handled.  Never! This article implies that asserts can be used for detecting real error conditions.  That is wrong.

  18. MSDNArchive says:

    So we agree that crashing is better than not crashing in cases where not crashing can lead to data loss, security exposures, etc.  Here’s a novel idea: let’s share that information with customers!  A few extra words in the crash dialog can do wonders for the customer experience.  "We are sorry for the inconvenience, but in order to protect your data and keep you secure, we need to close this program."  I think customers can understand that!

  19. /dev/null says:

    There’s been something of a brouhaha amongst the Microsoft technical bloggers in the last few days about this post by Eric Brechner, the director of the Engineering Excellence group at Microsoft. The short version is that asserts (hah) that asserts..

  20. MSDNArchive says:

    /dev/null gets it.  He points out (in the post linked above) that even though your code may hit a condition it was not designed for (and that you assert), you must still handle it in production.  The key is to achieve two separate but equally important goals: (1) provide a tight, strong feedback loop to developers, and (2) provide a good & resilient customer experience.

    Now I’m going to go a step further than Eric.  (I’ll take that liberty because Eric’s post was inspired, in part, by a conversation I had with him.)  I’m going to claim (note how I deftly avoided saying “I’m going to assert”) that Asserts, as commonly implemented, should be phased out entirely in favor of lightweight retail tracing with phone-home capability.

    First, let me state my assumptions:

    1. That a product is built with two flavors, retail & debug (or fre & chk).  And customers get the retail flavor.

    2. That the Assert action in a debug build is DebugBreak (int 3) (halt).

    3. That the Assert action in a retail build is no-op (i.e., not only is no action taken, but the validity check doesn’t exist either).

    4. That Asserts are coded “correctly”, in other words, only to detect “should not happen” states (states of internal inconsistency).

    5. And, of course, the most interesting and controversial assumption: that lab testing can never recreate all the conditions and codepaths that your code will experience in the wild.

    Given these assumptions, here’s my logic.

    1. Because of assumption #5, your customers will encounter conditions that didn’t occur in lab testing.

    2. Because of assumptions #1, #3 and #4, the code in your customers hands is not checking for these conditions it didn’t design for (assert condition), nor is it taking action (assert action).

    3. Because of assumption #2, when you’re running your own code in the debugger, the debugger stops when it hits an assert condition.  This makes it difficult, at best, to correlate the assert condition with its downstream effects (unless you’re really good at continuing past the assert and systematically observing & recording behavior).

    As mentioned above, what I favor instead of the commonly implemented Assert is lightweight retail tracing with (opt-in) phone-home capability.

    1. This serves the key goal #1 above by providing a tight, strong feedback loop with how code actually behaves in the wild.

    2. It also serves the key goal #2 above because the retail code checks for the interesting conditions, which forces developers to think about how to handle them.

    3. I propose this for a third reason as well: it creates less of a schism between the behavior of the retail flavor of products and the debug flavor of products.

  21. alanpa says:

    ummm…didn’t I read every bit of this 15 years ago in Writing Solid Code?

    Let me recap.

    Use assert liberally in DEBUG code. When something goes wrong, crash now. This is defensive coding 101. Asserts don’t happen in retail code – duh.

    *Handle* the errors in retail code. This is duh #2. For example (bad ex, but it’s common and short):

    p = malloc(…)

    ASSERT(p)

    if (p)

    {

       p = 1;

       etc.

    }

    else

    {

       // oh look at me, I’m so CLEVER

       // I caught an error

       // I can haz cheezberger?

       LogError("memory allocation in function blah…");

    }

    The type of logging described in the above coment has been around for at least 15 years (length of my professional coding career) – I suppose the big difference today is that the data gets logged to the event viewer instead of a text file, but that’s certainly nothing new.

    What do MS developers do when they hit an assert? It sounds like they just give up. Com on – any decent developer should be able to determine the down stream effects. Are you saying that only the "really good" developers at MS can do this.

    Seriously – wow.

  22. MSDNArchive says:

    alanpa: your post proves that even smart people can miss the point.

    Asserting the result of a "malloc" is a good example of an over-used assert.  If you’re doing any automated runs of debug code, your code will halt before it reaches any of the recovery code, which also means that none of the *real* (inconsistent state) asserts will ever be reached during that test run.

  23. Arye Gittelman says:

    I see a number of comments noting that people throw exceptions when they are not sure of their state.

    The trick is to isolate the potential state change until you’re sure that the change can be made before committing. This does require significantly different design work, but is very achievable.

  24. alanpa says:

    I think you’ve made the point that I was most concerned about. I used the malloc example because it was concise – not necessarily good.

    Regardless of how good your error handling is, a well written assert allows you to break at a place where you can step through the code. To me, this is fundamental. Allright- this is a bad assert example – it’s a code coverage trap example, but the point is still valid. On DEBUG builds, I still believe "crash fast "is the answer. I also believe that if you don’t "crash fast" once on the debug build in order to understand how your error handling code should work, that you will get it wrong.

    In a perfect world, you will write fantastic error handling code that does all the right things. In practice, I see a huge proportion of defects per line of code in error handling – because it’s never tested.

  25. MSDNArchive says:

    Asserts to break and step through code?  Dude, that’s what breakpoints are for.  Okay, you might do this on a developer’s workstation, but not in checked-in code.

    Halting asserts are not the way to increase test coverage of error handling.  Mock objects, failure injection, logs and traces are.

  26. Hi All,

    Although I am well known for entertaining the world through the past decades, my passion for software is not that well known. In fact I have I have been using a kernel debugger and doing hard code debugging ever since the Jackson 5 days.

    It was real a OFF THE WALL POST.  I really think that the suggestions here without proper context  and describing the applicability model are very DANGEROUS.

    For all those who have raised the concerns with the sanity aspects of this post, I am going to really ROCK WITH YOU. Even BILLIE JEAN will agree that some of the suggestions here are really BAD.

    When I first read the post I wanted to find out WHO IS IT, that is so much of a moron? When I found out that it is the director of engineering excellence and his people in his team who are supposed to be increasing the quality of software and championing best practice, What a THRILLER!!.

    I know that people with not much else to do WANNA BE STARTING’ SOMETHING’, but making such comments and then saying that the intent of the post was to court controversy just makes you a SMOOTH CRIMINAL.

    To all of you out there, do not follow the suggestion of trying to run around when you are not ina safe, unknown start. Crash and make use of the RegisterApplicationRestart API to safely start in a known state, for people that want to hobble around clamining that they want to run around when they should be dead, tell them BEAT IT!!

  27. DaiQian Huang says:

    This is a very good post.

    Today we are in massive coordination development mode, other than individual development mode.  asserts (halt) could be handy to you from your perspective, but it could be annoying or evil to me from my perspective. The question really is whether  asserts are handy or not to majority. I believe the answer is no. Plus, asserts are only handy *to me* during development phase, once a feature is done and ready to check in asserts have no values *to me*.

    I believe software should be resilience against failures from other components and attacks from hacker, for example,  timeout, invalid request etc. Asserting on failures from another components is bad.  I don’t believe software should be resilience against code defects(bugs). Logging or crashing both are good methods to find code defects.  In some cases, it is much easier to find the cause of a problem with a crash dump; In some other cases, logging is better.

  28. MSDNArchive says:

    DaiQian brings and important perspective to the discussion.  I’ve heard some legitimate arguments for why halting asserts are preferred during the development cycle:

    1. Halts at the point of the assert, enabling live debugging, and

    2. Halting asserts are painful.  They block progress forcing developer attention.

    These made sense at one time, but they are outweighed now:

    1. As DaiQian points out above, we (e.g. in Windows) are in a massive coordinated and *distrbuted* development cycle.  It’s rare any longer that when Tom’s assert fires that I can grab Tom down the hall and have him look at it.  Such an assert ends up blocking far more people than it helps.

    2. In conjunction with #1, though, is the assumption that breaking into the debugger is the most efficient way to capture state.  Just not true any longer.  With the ability to snap a dump at any time, plus advanced tools like Time Travel Tracing (http://research.microsoft.com/users/Manuvir/papers/instruction_level_tracing_VEE06.pdf?0sr=a)(or your own variety of instruction tracing), there are not only more efficient ways to capture state, but they are more debuggable as well.

    Fundamentally though I make a counterintuitive argument: that halting asserts diminish the expert’s ability to diagnose software disease at the customer’s shop.  An expert diagnostician learns to correlate symptoms (manifestation) with disease (root cause) and to build that correlation through observation.  Thus this expert needs to be able to observe both the disease and the symptom as often as possible.  Logging asserts provide that ability far more than halting asserts.

  29. jporkka says:

    I’m late to the party, but…

    If the intent of this article is to be provacative and get people talking, maybe that is a good thing.

    The content though, I think is pretty misleading at best.

    There is a huge distinction between SEH exceptions and C++ exceptions, that has often been missed in the past, continues to be confused in .Net languages, and is also confused in this article and the "tradey" article that is referred to.

    Second, assertions: There are plenty of bad assertions in the world.

    Certainly though I think most would agree that asserting against run-time failures is a bad approach to error handling. Should you assert that malloc returns non-null, or that CreateFile() succeeded? Clearly, no.

    Asserts are all about logic errors. A way to say something very loudly and clearly about what the expected internal state of the program is. If an assertion fails, it means at the very least that something has happened that the programmer expects should not happen. By definition this means the program is in an ill-defined state.

    The ones that never fail are incredibly valuable too … they let programmers know what the expectations are for a given piece of code. For this they are vastly better than comments, since the programmer can have a level of confidence that the assertion speaks the truth and hasn’t been out of date since the first time the code was checked in.

    Perhaps, the best approach is actually dependent on your environment. What’s right way to recover from errors for MSWord isn’t likely to be the same as for Shuttle flight control systems, or the latest Xbox game.

    What should an assertion do? It should point out that a bug exists. How best to do that depends a lot on the environment that the software is running in.

  30. Swanny says:

    My opionion of Microsoft has gone up considerably since I found out they let this guy run around like this.

    And I’m suprised just how many people don’t seem to get it. Or maybe I should’nt be…

    I walked into a multi-million dollar project once (Aussie dollars which were much smaller than USA dollars back then but still an expensive project). It was a Java project, but could have equally been a C# project (except that C# hadn’t been invented yet). They had a try/catch around every method, I kid you not, every method. But the real issue was all each catch ever did was write the exception to a log file and then return. That is the program did not abort, nor did the user get any feedback that something had gone wrong, unless of course they then went and looked up the log file. Every method was written like this but not one of the so called experts already involved could see what was wrong with it.

    While trying to explain to them what was wrong with this approach I came up with with the slogan "The minimum an exception handler must do is ‘Report and Abort’". That is, tell the user, operator or whoever needs to know that something went wrong, and don’t try to continue.

    But this was the "minimum" I expected. Dr Watson (is it still called that?) is therefore the minimum behaviour I would expect.

    I then tried to explain how you don’t need a try/catch on every method… well I guess I was feeling lucky that day.

    The point (or at least one of the points) Eric is trying to make is that we can do much better.

    For example, suppose we are processing a batch of 1,000,000 independent records, and during the processing of the 23rd record we get an exception we weren’t expecting. Why not report that record with the error and then move onto trying to process the next record? That seems like very sane handling of the exception to me, but I don’t have to pull the ejection handle and loose the jet just yet. Perhaps as a sanity check I allow up to 20 bad records, or 1% bad records, and if I exceed this limit then I abort. But I don’t just give up because I hit my first unexpected exception, and if they can be processed, the remaining records still get done.

    Let’s take a user interface example where "report and abort" is the poorer user experience. Suppose I had a bad day and forgot to set the maximum length of a text box to 20 characters, and unlucky me, it gets all the way through test to production, and along comes a user who just entered all the details for Mr Isiabanawantabemilleci, hits OK and somewhere in there an exception is thrown. Tough luck user, we are just reporting and aborting here. Perhaps Mr Isia… is on the phone to the user as they were entering the data. "I’m sorry Mr Isi…sir, I’ll just need to get all those details off you again please….thankyou…oh no it crashed again". Even if the exception message is not informative, if the application was left so the user could still work with it, they could capture the screen or maybe cut and paste some or all of the field values to notepad so they don’t have to retype it all later. If, as I prefer, they have good contract checking (not just on debug builds) then they might get a friendly exception like: "Contract Violation: Surname must be from 1 to 20 characters". In which case the user could shorten the surname until something better could be done.

    The real reason I’ve found that most people don’t handle exceptions in Win Forms apps is that it’s just too tedious writing a try/catch on every event that the user can launch (every button click, form load, on change event, etc).

    However, since .Net 2.0 we have had the Application.ThreadException event. This is a great place to put a central dialog that apologises to the user, explains that the current action could not be completed because of this exception message, and advises them on their options. Then you only need try/finally around any clean up stuff, and try/catch only where you need to catch a specific exception to do some specific handling.

    This is not that hard to do.

    And of course there is the report, abort and let’s see what we can do to help the user now action, such as when Word tries to recover a file for me. I find this feature very nice in Word, though I admit I’d find Word not crashing an even nicer feature. They must have used human programmers to write that thing.

    I’m actually for leaving contract checking (or asserts) in the production code. This is probably a different point to what Eric was making, but the idea that I check for unusual events only in my debug builds seems absurd. So what if I waste a million CPU cycles every second in this checking? If I had a machine that measured it’s processing in megahertz I might be worried. And if your worried about your CPU cycles then why arn’t you programming in assembler (apologies to all the assembler coders out there). To me, at least for business software, the ability to detect a problem as early as I can and thus help in conserving my limited resources such as developer/support time, is far more useful than my ability to save a tiny portion of a resource I have plenty of.

    Anyway, great job Eric. The point is still just as valid now as it was so many years ago. And a fun column too.

  31. russellh says:

    "What’s the point of telling me an operation failed if there’s no action I can take to fix it or prevent it from happening again? Why not just tell me to put an axe through the screen? If there is a constructive action I can take, why doesn’t the code just take it? And we have the audacity at times to think the customer is dumb? Unbelievable."

    If an application shows a user a graph that may be wrong because of of an unanticipated exception, but for what ever reason the procedure generating the graph won’t stop throwing some unanticipated exception, don’t you want to tell the user there is something wrong with the graph?  Maybe you don’t show the graph at all, but tell the user you couldn’t?

  32. Will Lees says:

    Way late to the party on this discussion. You will notice that the five R’s are succesive levels of throwing away the dynamic in-memory scope/context you are at. Sort of like hitting your head with progressively larger hammers until you’ve forgotten enough to move forward again with your life. Do-over.

    The way I see it, this is a trade-off between continuous processing style and fast-restart style. It depends on who’s willing to pay the price to get messy. Take windows for example. Vista/w2k8 are ‘current’, and it’s in demand to live-debug a running system to patch bugs and get the system to run continuously without restarts. Now consider nt4 or w2k. They are good enough the way they are (the way they became at the level of expectation at the time). People still want to use these versions, but they don’t pay for live debug anymore. To improve resiliance in old code, the easiest way is to (re-) start throwing away state and re-image. That’s why VMs are used.

    It really depends on the hard-core-ness of the problem space. That’s the difference between applications and systems. For a web-app, it’s largely stateless and it’s easier for that level of dev audience to restart, rather than comprehend complex combinatoral state spaces. But for drivers say, the hardware/firmware and disk structures persist in a bad state – we can’t wish them away. That’s why we get payed the big bucks. There are certail levels of criticality in low level systems where restart is not an option.

    Sometimes I wonder why our job is never easy. We never get the easy problems. It’s always something about sharing memory – like combining ten formerly separate processes into one. We in the os/systems business can’t afford garbage collection. We share memory and manage memory and alias pointers and do all the hard tricks necessary to run in 2mb. It’s about costly separation. It’s easy and cheap-man-power to pull systems apart into separate transactions, separate threads, separate processes, separate os instances. But it’s wasteful in time and resources. But for mission critical systems on old hardware – it can’t be separated, it can’t context switch, and it can’t restart without losing everything.

    I’m just saying you have to look at the economics of stateful shared memory continuous systems. There are two kinds of platforms: small and singular, or large and separate. Think processes. If you can get away with a design which puts all parts in separate processes (good enough perf and cost), why wouldn’t you? BUT, there are these certain key scenarios, where it can pay off to invest in embedded, shared-memory, real-time. Like the first person to make win95 run in real-mode. We make our living and our magic in these tight persistant spaces. So we have to choose wisely when to create these fragile hardcore components.

    Today the hardware/memory is abundant enough that we can waste cycles on garbage collection and byte-codes and interpetation – wimping out on separation of memory because its rapid development, easier to get people, and it’s good enough. It’s like programming in managed code. The economics of managed code make this restart-thinking viable for more classes of problems. It’s foolish not to sell automatic transmission when it’s good enough. BUT, there are still segments of the industry, drag racing and older vehicles and farming and trucks, where shared memory and close quarters are required.

  33. You’ve seen the advice before— it’s not a good programming practice to catch System.Exception . Because

  34. “I heard a remark the other day that seemed stupid on the surface, but when I really thought about it

  35. Newbie says:

    Help me design this:

    Open a file on a network share

    Start reading/writing that file

    Oops – network glitch – ReadFile() returns error 64 (or similar)

    The File Handle is now hosed.

    How do I recover from this? Half the file is updated. Why should a small network glitch kill my File Handle (and my process?)

  36. Rüdiger Stevens says:

    So the solution is simple:

    – Move the code that often crashes to an own process and add a communication between the hosting process and the sub-process.

    Let all 1990 asserts and friends in your code and let it crash/dump/restart. It does not pull your host down!

    Separating into two processes gives you the guarantee that there are no more side effects from the inconsistent code once the sub-process has died. This is something you can nearly never guarantee for code running in the main process (it could have corrupted the heap, has open handles, threads running with code doing harm, …).

    In .NET you may use AppDomains for separation but this could still leave uncontrollable threads running.

    So both parties are happy: The user wants the UI to be running and don't want to loose data: checked!. On the other side the sub-process can keep it's asserts and follow the fail-fast principle: checked!