Careful with that axe, part two: What about exceptions?

(This is part two of a two-part series on  the dangers of aborting a thread. Part one is here.)

Suppose you’re shutting down the worker thread we were talking about last time, and it throws an exception? What happens?

Badness, that’s what. What to do about it?

As in our previous discussion, it is better to not be in this situation in the first place: write the worker code so that it does not throw. If you cannot do that, then you have two choices: handle the exception, or don’t handle the exception.

Suppose you don’t handle the exception. As of I think CLR v2, an unhandled exception in a worker thread shuts down the whole application. The reason being, in the past what would happen is you’d start up a bunch of worker threads, they’d all throw exceptions, and you’d end up with a running application with no worker threads left, doing no work, and not telling the user about it. It is better to force the author of the code to handle the situation where a worker thread goes down due to an exception; doing it the old way effectively hides bugs and makes it easy to write fragile applications.

Suppose you do handle the exception. Now what? Something on another thread threw an exception, which is by definition an unexpected, exceptionally bad error condition. You now have no clue whatsoever that any of your data is consistent or any of your program invariants are maintained in any of your subsystems. So what are you going to do? There’s hardly anything safe you can do at this point.

The question is “what is best for the user in this unfortunate situation?” It depends on what the application is doing. It is entirely possible that the best thing to do at this point is to simply aggressively shut down and tell the user that something unexpected failed. That might be better than trying to muddle on and possibly making the situation worse, by, say, accidentally destroying user data while trying to clean up.

Or, it is entirely possible that the best thing to do is to make a good faith effort to preserve the user’s data, tidy up as much state as possible, and terminate as normally as possible.

Both today’s question and the one from last time are specific versions of the more general question “what do I do when my subsystems running on worker threads do not behave themselves?” If your subsystems are unreliable, either make them reliable, or have a policy for how you deal with an unreliable subsystem, and implement that policy. That’s a vague answer I know, but that’s because dealing with an unreliable subsystem is an inherently awful situation to be in. How you deal with it depends on the nature of its unreliability, and the consequences of that unreliability to the user’s valuable data. There are no easy one-size-fits-all answers here, unfortunately.

(This is part two of a two-part series on  the dangers of aborting a thread. Part one is here.)

Comments (10)

  1. Tim Goodman says:

    Hey Eric, I’ve discovered your blog recently and I love it.  I do have a random question for you: How do you pronounce your last name? “LIP-pert”, “LYE-pert”, “li-PAIR” (like the chef Eric Rippert)?

    I ask because I’d like to be able to tell my friends and coworkers “You really should be reading Eric Lippert’s blog” without sounding like an idiot.

    Glad you like the blog, and thanks for asking. The first on your list is correct. The name is German in origin, which is not surprising considering that the part of Ontario I grew up in was largely settled by German immigrants. In my mother’s childhood it was still common to hear German spoken as a first language in people’s homes, though not so much these days. The other Eric Lipperts I’ve run into online over the years tend to be of German, Dutch or Scandinavian origin. — Eric

  2. Tim Goodman says:

    Eric Ripert apparently spells his name with just one p, so I already sound a *little* like an idiot.

  3. KooKiz says:

    Unfortunately? Hey, if they were a pre-made answer for everything, coding wouldn’t be as fun.

  4. Phil Nash says:

    @Tim, you are inherently unreliable and must be shut down 😉

    @Eric You didn’t mention a third case, which is where a thread might be doing something "off to the side" and it’s ok if it fails, or needs to be restarted – it doesn’t interfere with the rest of the system. So an unexpected exception can be handled without terminating. I admit that this situation occurs less frequently than many developers seem to think, but it remains a legitimate case, nonetheless.

  5. danielearwicker says:

    If you need to deal with potentially unreliable subsystems in some way (other than dying on your arse), then process isolation is a popular solution, and may not be as heavyweight as it sounds. IIS does it, Google Chrome does it. Of course its still not perfect: separate processes are obviously not entirely isolated or there wouldn’t be any point running them, so they can leave behind a shared mess when they terminate (e.g. temporary files).

    But an unexpected exceptions from any thread in the same process as you? Forget it. Log any information you can easily get without causing worse trouble (hopefully a stack trace at least), and give up the ghost as fast as possible.

    And as I never tire of publicising this excellent tip from the CLR team: if you already have one unexpected exception in play, you don’t want to trigger any more (or do persistent damage, or destroy evidence), so don’t let any finally blocks execute. How do you stop finally blocks running? They only run when an exception has actually been caught (and of course, only expected exceptions should ever be caught). So don’t catch all exceptions in Main (or anywhere). Instead, handle the AppDomain.UnhandledException event, do your logging, and then call Environment.FailFast before your out-of-control program gets any more creative on you.

  6. Adam says:

    @ danielearwicker

    I thought finally blocks ran regardless?  The only way I know that a finally block won’t run is if you don’t step into it’s corresponding try to begin with.

    Of course, I stand to be corrected.

  7. arnshea says:

    Great post.  I’m totally grateful for the change made in CLRv2; sometimes it’s much worse to cover a fatal error than to just crap out.  And you provide an unhandled thread exception handler if you really want to handle it.

    There’s some element of subjectivity here that I don’t know how you escape.  The line between defensive programming and dangerous masking of errors is not always bright.

  8. Matt Warren says:

    @Adam – Environment.FailFast makes sure that any active finalizers aren’t run, see the MSDN docs for more info

  9. danielearwicker says:

    @Adam – well, if control enters a try block with an associated finally block, a simple way to stop the finally block running is to pull out the power cable! 🙂 But that involves very careful timing and skill. Fortunately there are more useful ways also.

    I’m not sure about which CLR version this was established in (I think it may have been less reliable in the past) but the following seems to be solid in the current CLR 2.0 that ships with ‘.NET 3.5 SP1’.

    If an exception escapes uncaught from your Main method, the AppDomain.UnhandledException event is fired. Simultaneously (I think) the Windows Error Recovery system kicks in, allowing a debugger to attach and inspect the stack before it is unwound. But only after the UnhandledException event has been handled are the finally blocks on the stack executed.

    This is very different from what happens if the exception is caught: if so, the nested finally blocks runs first – but only because a matching catch for the exception has been located on the call stack.

    So an uncaught exception provides a special opportunity to disallow finally blocks from running. We can capture a minidump, or a simple stack trace, wait for a debugger to attach, or simply just kill the process. Running of finally blocks is *optional*.

    Some people react negatively to this, because they believe that the whole value of finally blocks is that they offer a guaranteed way to get some code to run. But in fact they would be significantly *less* valuable if that were the case. They have positive value if they only run in situations you can recover from. But they are actually become a *burden* if they also run in situations you cannot recover from – because how on earth can you write them correctly, if they have to somehow operate in the face of any kind of bug or corruption of program state?

    To make them only run when they ought to, you have to be disciplined about what you catch. In particular, avoid try { } catch (Exception e) { /*examine e and optionally… */ throw; }. Catching/rethrowing an exception is not the same as not catching it at all, because it triggers all the nested finally blocks at that point – even if the exception is NullReference and therefore indicative of a screw-up.

    Assuming it is practical to get all this right (depends how much control you have over 3rd party code in your process, of course), then you can design your finally blocks with less paranoia.

  10. Pavel Minaev [MSFT] says:

    As a side note, I’ve found it much easier to deal with background thread abort when writing in F# – simply because the thread is doing a completely encapsulated computation that does not affect the state of the system in any way until fully computed (and then the result is published via a single atomic reference field assignment).

    Of course, the same thing is perfectly doable in C#, it’s just that in F# you’re always in that "no side effects" mental mode by default, the language making it easiest. That it also makes it easy to discard a thread without caring about invariants and such was an unintended side-effect that was discovered long after the code was written.