Asynchrony in C# 5, Part Eight: More Exceptions

(In this post I'll be talking about exogenous, vexing, boneheaded and fatal exceptions. See this post for a definition of those terms.)

If your process experiences an unhandled exception then clearly something bad and unanticipated has happened. If its a fatal exception then you're already in no position to save the process; it is going down. You might as well leave it unhandled, or just log it and rethrow it. If it had been anticipated because it's a vexing or exogenous exception then there would be a handler in place for it. An unhandled vexing/exogenous exception is a bug, but probably one which does not actually indicate a logic problem in the program's algorithms, it's just an oversight.

But if you have an unhandled boneheaded exception then that is evidence that your program has a very serious bug indeed, a bug so bad that its operation cannot continue. The boneheaded exception should never have been thrown in the first place; you never handle them, you make for darn sure they cannot possibly happen. If a boneheaded exception is thrown then you have no idea whatsoever what locks were released early, what internal state is now corrupt or inconsistent, and so on. You can't do anything with confidence, and often the best thing to do in that case is to aggressively shut down the process before things get any worse.

We cannot easily tell the difference between bugs which are missing handlers for vexing/exogenous exceptions, and which are bugs that have caused a program crash because something is broken in the implementation. The safest thing to do is to assume that every unhandled exception is either a fatal exception or an unhandled boneheaded exception. In both cases, the right thing to do is to take down the process immediately.

This philosophy underlies the implementation of unhandled exceptions in the CLR. Way back in the CLR v1.0 days the policy was that an unhandled exception on the "main" thread took down the process aggressively, but an unhandled exception on a "worker" thread simply killed the thread and left the main thread running. (And an exception on the finalizer thread was ignored and finalizers kept running.) This turned out to be a poor choice; the scenario it leads to is that a server assigns a buggy subsystem to do some work on a bunch of worker threads; all the worker threads go down silently, and the user is stuck with a server that is sitting there waiting patiently for results that will never come because all the threads that produce results have disappeared. It is very difficult for the user to diagnose such a problem; a server that is working furiously on a hard problem and a server that is doing nothing because all its workers are dead look pretty much the same from the outside. The policy was therefore changed in CLR v2.0 such that an unhandled exception on a worker thread also takes down the process by default. You want to be noisy about your failures, not silent.

I am of the philosophical school that says that sudden, catastrophic failure of a software device is, of course, unfortunate, but in many cases it is preferable that the software call attention to the problem so that it can be fixed, rather than trying to muddle along in a bad state, possibly introducing a security hole or corrupting user data along the way. Software that terminates itself upon encountering unexpected exceptions is software that is less vulnerable to attackers taking advantage of its flaws. As Ripley said, when things go wrong you should  take off and nuke the entire site from orbit; it's the only way to be sure. But does this awesome philosophy serve the async scenario well?

Last time I mentioned two interesting scenarios: (1) what happens if a task-returning async method does a WhenAll or WhenAny on multiple tasks, several of which throw exceptions? and (2) what if a void-returning async method awaits a task which completes abnormally? What happens to that exception?

Let's consider the first case first.

WhenAll collects all the exceptions from its completed sub-tasks and stuffs them into an aggregating exception. When all its sub-tasks complete, it completes its task abnormally with the aggregated exception. A slightly bizarre fact, however, is that by default, the EndAwait only re-throws the first of those exceptions; it does not re-throw the entire aggregating exception. The more common scenario is for any try-catch surrounding an "await" to be catching some set of specific exceptions; making you always write code that goes and unpacks the aggregating exception seems onerous. This may seem slightly odd; for more details on why this is a reasonable idea see Jon Skeet's recent posts on the topic. 

The WhenAny case is similar. Suppose the first sub-task completes, either normally or abnormally. That completes the WhenAny task, either normally or abnormally. Suppose one of the additional sub-tasks completes abnormally; what happens to its exception? The WhenAny is done: it has already completed and called its continuation, which is now scheduled to run on some work queue if it hasn't already.

In both the WhenAll and WhenAny cases we have a situation where there could be an exception that goes "unobserved" by the creator of the WhenAll or WhenAny task. That is to say, in both these cases there could be an exception that is thrown, automatically caught, cached and never thrown again which in the equivalent synchronous code would have brought down the process.

This seems potentially bad. Should an unobserved exception from a task that was asynchronously awaited take down the process, as the equivalent synchronous code would have?

Suppose we decide that yes, an unobserved exception should take down the process. When does that happen? That is, when do we definitively know that the exception actually was not re-thrown? We only know that if the task object is finalized without its result ever being observed. After all, a "living" task object that has completed abnormally could have its continuation executed at any time in the future; it cannot know when that continuation is going to be scheduled. There could be any number of queued-up tasks on this thread that get to run between the time this task completed abnormally and its result is requested. As long as the task object is alive then its exception could be observed.

OK, so, great, if a task is finalized, and it completed abnormally then we... what? Throw the exception on the finalizer thread? Sure! That will take down the process, right? In CLR v2.0 and above, unhandled exceptions on any thread take down the process. But let's take a step back. Remind me, why do we want an unobserved exception to take down the process? The philosophical reason is: we cannot tell whether this was a boneheaded exception that indicates a potentially horrible, security-impacting situation that needs to be dealt with by immediate termination, or simply the result of a missing handler for an unanticipated exogenous exception. The safe thing to do is to say that it was a boneheaded exception with a security impact and immediately take the process down. Which is precisely what we are not doing! We are waiting for the task to be collected by the garbage collector and then trying to take the process down in the finalizer thread. But in the gap between the exception being recorded in the task and the finalizer observing the exception, we've potentially kept right on running dozens more tasks, any of which could be using the inconsistent state caused by the boneheaded exception.

Furthermore, we anticipate that most async tasks that throw exceptions in realistic code will in fact be throwing exogenous exceptions like "the password for this web service is wrong" or "you don't have permission to read this file", or "this operation timed out", rather than boneheaded exceptions like "you dereferenced null" or "you tried to pop an empty stack". In these realistic cases it seems much more plausible to say that if for some reason a task completes abnormally and no one bothers to observe its result, it's because some asynchronous unit of work was abandoned; any of its sub-tasks that ran into problems connecting to web servers (or whatever) can safely be ignored.

In short, an unobserved exception from a finalized task is one that no one cares about, is probably harmless, and if it was harmful, then we've already delayed taking action too long to prevent more harm. Either way, we might as well just ignore it.

This does illustrate that asynchronous programming introduces a new flavour of security vulnerability. If there is a security vulnerability caused by a bug that would normally take down the process, and if that code is rewritten to be asynchronous, and if the buggy task is abandoned without observation of its exception, then the bug might not result in an aggressive destruction of the now-vulnerable process. And even if the exception is eventually observed, there might be a window in time between when the bug introduces the vulnerability and the exception is observed. That window might be large enough for an attacker to succeed. That sounds like a tortuous chain of things that have to go wrong - because it is - but attackers will take whatever they can get. They are crafty, they have all the time in the world, and they only have to succeed once.

I never did say what happens to a void-returning method that awaits a task; you can think of this as a "fire and forget" sort of method. Perhaps a void-returning button-click event handler awaits fetching some data asynchronously and then updating the user interface; there's no "caller" of the event handler that cares to hold on to a task, and will never observe its result. So what happens if the data-fetching task completes abnormally?

In that case, when the void-returning method (which registered itself as a continuation, remember) starts up again, it checks to see if the task completed abnormally. If it did, then it immediately re-throws the exception to its caller, which is, of course, probably some message loop. I believe the plan of action here is to be consistent with the behaviour described above; in that scenario the message loop will discard the exception, assuming that the fire-and-forget asynchronous method failed in some benign way.

Having been an advocate of the "nuke from orbit" philosophy of unhandled exceptions for many years, emotionally this does not sit well with me, but I'm unable to marshal a convincing argument against this strategy for dealing with exceptions in task-based asynchrony. Readers: What do you think? What is in your opinion the right thing to do in scenarios where exceptions of tasks go unobserved?

And on that somewhat ominous note, I'm going to take a break from talking about the new Task Asynchrony Pattern for now. Please download the CTP, keep sending us your feedback and questions, and start thinking about what sorts of things that will work well or work poorly with this new feature. Next time: we'll pick up with more fabulous adventures after American Thanksgiving; I'm cooking turkey for 19 this year, which should be quite the adventure in of itself.