Careful with that axe, part one: Should I specify a timeout?


Careful

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

The other day, six years ago, I was was talking a bit about how to decide whether to keep waiting for a bus, or to give up and walk. It led to a quite interesting discussion on the old JoS forum. But what if the choice isn’t “wait for a bit then give up”, instead it is “wait for a bit, and then take an axe to the thread”? A pattern I occasionally see is something like I’ve got a worker thread that I started up, I ask it to shut down, and then I wait for it to do so. If it doesn’t shut down soon, take an axe to it:

this.running = false;
if (!workerThread.Join(timeout))
    workerThread.Abort();

Is this a good idea?

It depends on just how badly the worker thread behaves and what it is doing when it is misbehaving.

If you can guarantee that the work is short in duration, for whatever ‘short’ means to you, then you don’t need a timeout. If you cannot guarantee that, then I would suggest first rewriting the code so that you can guarantee that; life becomes much easier if you know that the code will terminate quickly when you ask it to.

If you cannot, then what’s the right thing to do? The assumption of this scenario is that the worker is ill-behaved and does not terminate in a timely manner when asked to. So now we’ve got to ask ourselves “is the scenario that the worker is slow by design, buggy, or hostile?”

In the first option, the worker is simply doing something that takes a long time and for whatever reason, cannot be interrupted. What’s the right thing to do here? I have no idea. This is a terrible situation to be in. Presumably the worker is not shutting down quickly because doing so is dangerous or impossible. In that case, what are you going to do when the timeout times out? You’ve got something that is dangerous or impossible to shut down, and its not shutting down in a timely manner. Your choices seem to be

(1) do nothing
(2) wait longer
(3) do something impossible. Preferably before breakfast.
(4) do something dangerous

Choice one is identical to not waiting at all; if that’s what you’re going to do then why wait in the first place? Choice two just changes the timeout to a different value; this is question begging. By assumption we’re not waiting forever. Choice three is impossible. That leaves “do something dangerous”. Which seems… dangerous.

Knowing what the right thing to do in order to minimize harm to user data depends upon the exact circumstances that are causing the danger; analyze it carefully, understand all the scenarios, and figure out the right thing to do. There’s no slam-dunk easy solution here; it will depend entirely on the real code running.

Now suppose the worker is supposed to be able to shut down quickly, but does not because it has a bug. Obviously, if you can, fix the bug. If you cannot fix the bug — perhaps it is in code you do not own — then again, you are in a terrible fix. You have to understand what the consequences are of not waiting for already-buggy-and-therefore-unpredictable code to finish before disposing of the resources that you know it is using right now on another thread. And you have to know what the consequences are of terminating a thread while a buggy worker thread is still busy doing heaven only knows what to operating system state.

If the code is hostile and is actively resisting being shut down then you have already lost. You cannot halt the thread by normal means, and you cannot even reliably thread abort it. There is no guarantee whatsoever that aborting a hostile thread actually terminates it; the owner of the hostile code that you have foolishly started running in your process could be doing all of its work in a finally block or other constrained region which prevents thread abort exceptions.

The best thing to do is to never get into this situation in the first place; if you have code that you think is hostile, either do not run it at all, or run it in its own process, and terminate the process, not the thread when things go badly.

In short, there’s no good answer to the question “what do I do if it takes too long?” You are in a terrible situation if that happens and there is no easy answer. Best to work hard to ensure you don’t get into it in the first place; only run cooperative, benign, safe code that always shuts itself down cleanly and rapidly when asked. Careful with that axe, Eugene.

Next time, what about exceptions?

(This is part one of a two-part series on the dangers of aborting a thread. Part two is here.)

Comments (19)

  1. Chris B says:

    I never realized that a finally block prohibited thread abort exceptions. I think I understand the reasoning behind it (don’t die without cleaning up), but is malicious C# really as easy as

    try

    {

       throw new Exception("MWAH HA HA!!");

    }

    finally

    {

      MaliciousFunction();

    }

    If so, what mechanisms can terminate a malicious .Net process that is executing in a finally block?  What would happen in this situation if MaliciousFunction() ran unmanaged code?

  2. Robert Davis says:

    @Chris B

    Pretty sure Environment.FailFast() will kill a finally block as will running the possibly malicious code in a separate process and using Process.Kill().

  3. Gabe says:

    Once I nearly resorted to spawning a "taskkill myprocess.exe" to kill a rogue thread that sometimes would cause my process not to exit. This rogue thread was in a buggy old IE plug-in that wasn’t supported by the vendor anymore, so I just had to use whatever means necessary to work around its quirks.

  4. JP Bochi says:

    I would add one extra option to that list:

    (5) ask another entity (a human user, perhaps)

  5. Claudio says:

    To add safety, you can also isolate threads that you do not fully control (due to their needing to use components you do not own/fully control) into their own app domains, hence limiting their potential for damage.  Of course, that limits your ability to interract with them too.

    Ideally, in threads that run your own code, you should design a shared status flag into them (i.e. bool shouldContinue) or something, and thus allow you a clean means of your own design to cancel a thread’s operation, and use that.  If you do that, and the thread does not shut down, you may as well hard shut it down – you are lost in that scenario anyway.

  6. Erik Schulz says:

    Some commandments of mutli-threading.

    All threads shall terminate quickly when ask to.

    Thou shall only use one primitive for multi-threading.

    Thou shall treat every thread as if it is a feature.

    Thou shall not share data incautiously.

    Thou shall not thrash the L2 cache.

  7. Bob says:

    Eric,

    If a failure to shut down when told to do so is dangerous I’d hate to think of the consequences in the failure of a component that is a single point of failure in such a system.

  8. Kevin says:

    I really hate this situation. This is often the case when working with someone else’s interface. You call into some interface whose implementation you have no control over, and it decides to never return. Ugh.

    I wish some developers would be honest with their naming. Instead of calling it FetchValue(…), they should call it FetchValue_OrHangForeverWithoutGivingTheDeveloperAnyWayToCancelThisStupidCall(…).

  9. Mike Dunn says:

    Was that a Hitch-hiker’s reference in #3?

  10. Stu says:

    Mike: no, Alice in Wonderland.

  11. Chris Nahr says:

    "Presumably the worker is not shutting down quickly because doing so is dangerous or impossible."

    That’s an unwarranted assumption.  The worker might not be shutting down simply because it’s doing some long-running computation, and the worker code wasn’t peppered with frequent checks for cancellation requests.  Moreover, doing so makes the worker code harder to read, is rather laborious, and possibly very difficult since you need to balance performance with quick cancellation.  Thread.Abort is the only practical option here, and it’s also safe if the code is working only on managed memory with no outside dependencies.

    However, I hear F# (or was it the new parallel library in .NET 4?) has a nice trick where parallelized loops do automatic cancellation checks at each iteration, so you don’t have to manually add those checks to the worker code.  That should take care of some cases that are amenable to this kind of parallelization.

  12. Dave Sexton says:

    Potentially hostile code (such as plug-ins) are often executed in a new AppDomain.  For those unaware, this approach offers the use of CAS to limit the code’s capabilities (i.e., sandbox), makes communication with the main application quite easy (e.g., MarshalByRefObject), and also provides a way to terminate the code and unload all of its assemblies without affecting the main AppDomain.

    However, AppDomain cannot solve the "aborted thread" problem as you’ve described for potentially harmful code, in all cases, due to CERs.  Whenever CannotUnloadAppDomainException is thrown by a call to AppDomain.Unload, calling Environment.FailFast will ignore all finally blocks, except for CERs.  I believe we can prevent plug-ins from using CERs by denying them unmanaged code privileges via CAS.  But of course, if plug-ins need unmanaged code execution rights, then we have to make the difficult choice of running them out-of-proc (possibly sacrificing CAS altogether, and increasing the difficulty of communicating with the main app), or running them in-proc and leaving it up to the user to "Fail Fast" manually by terminating the process in Task Manager.

  13. The "have cancellation checks" alternative to thread.abort is a terribly poor solution to what (in other contexts) is a solved problem.  It’s not always easy to predict which component needs cancellation checks where; large very valuable bodies of external code you’ll be using don’t include these new cancellation points (but potentially *do* include finally blocks – go figure), and adding cancellation points in complex computations may be tricky and obscure the code.

    If an operation reads a bunch of memory, and writes to a particular segment of memory, then it’s principally OK to stop the operation in whatever invalid state half-way so long as you discard the contents (or consider non-sensical) of the output.  This approach much, much more naturally maps to how encapsulation in general works; it plays particularly nice with functional code, and, via finally blocks or some other mechanism it can still permit an unlimited amount of cleanup after asynchronous abort if it’s absolutely critical (which it really shouldn’t be).

    Now, we *could* stick that abortable code in a seperate process.  However, the .NET framework in particular doesn’t make that easy: the new process lives in a new memory space, meaning that even read-only access to critical datastructures isn’t possible (or at least not easily), and even if it were, you suddenly lose the benefits of the GC since the GC doesn’t track cross-process references.  And, obviously, can’t use a whole host of standard features for no particularly good reason (e.g. things like Parallel.ForEach).  Finally; it’s also a much less stable solution: whereas previously a synchronization error might corrupt one process and leave it with a few thousand threads, using processes like this by default breaks containment; killing one does not necessarily kill the rest – so bugs can more easily undermine the whole system’s stability.  If processes were as cheap as threads, and the instantiation library would instruct the OS to consider new processes as sub-processes to be terminated on a parents exit, and the GC were multi-process aware and accessing other processes objects (at least in a read-only fashion) were transparently possible, sure, *then* processes would be a decent option.

    None of this is to say that Thread.Abort might not be potentially problematic *right now*.  However, in the long run, it’s a much more fundamentally attractive proposition than manually sprinkling code with "stopping points".  The OS switch from cooperative multi-tasking to preemptive multitasking; and for similar reasons thread termination should not *require* manual handholding – it’s just too bug-prone, and breaks encapsulation.

    If .NET aims to be a useful platform for multi-threaded development, it should aim to make such development easy and reliable.  Manually sprinkling code with stopping points is anything but easy and reliable.

  14. MarcelDevG says:

    Funny, Just got back from The Australian Pink Floyd. The didn’t play that, but it was very good!!

    What about threads which itselfs uses timeouts

    I use worker threads to monitor devices over the internet. The worker threads are sending commando’s with timeouts over sochets.

    When stopping the worker thread I wait the longest timeout, but sometimes the thread does not respond in a timely fashion.

    So I’m using the Axe. Didn’t find another option.

  15. Anon says:

    > The "have cancellation checks" alternative to thread.abort is a terribly poor solution to what (in other contexts) is a solved problem.

    @Eamon, so what’s the solution then to the ‘solved problem’?

    I’d agree that predicting is hard without profiling, but typically a worker thread is going away to do something specific, and for me that usually means something like ‘go and process the next n requests’ and at the end of every n, the thread checks for cancellation.

    I’ve written a few windows services, and these checks are built into the base class that each worker runs. We try to keep sys admins happy and avoid "The service failed to end in a timely fashion" messages.

    So, if you’re creating a background process, or anything unattended you have to raise the bar, and you should always raise the question ‘Is it cancellable, and how long do I let it run before terminating?’

    So as long as you know the worst case behaviour and communicate that out, then the code that spawns the process knows that it should on average take x, so if, after requesting cancel, the time is > x + some fudge factor, then go ahead and take the axe to it.

    The axe coming out really points to a defect in the worker, and your code should log that. If you’re calling a 3rd party component that’s run away on you, well at least you have an axe to grind.

  16. Pop.Catalin says:

    Eric, I agree, "careful" is the key word here:

    A few months ago, I’ve run into a issue where a mobile application wasn’t shutting down in some special conditions (Losing wi-fi connectivity when shutting down a 3rd party component while sending data). We reported the bug to the vendor along with code that could repro the bug fairly consistently on out test devices, unfortunately the bug was closed  with "no repro" after several iterations of knowledge exchange. So I had to pull the axe. The code for the app was specifically written to be resilient in case of errors both on client side and server side, so the only outcome could have been some minor data loss, but we did a small redesign to deal with such a issue, by storing all unprocessed data locally.

    Our profession is not  dangerous (from a failure perspective)  than most are, we are faced with human errors, hardware errors and software errors , possibility of data loss or exceptional conditions can creep up at any time, so you have to be prepared to pull the axe, and know how to use it without hurting yourself or others. This is a fact, you can’t avoid using some extreme measures indefinitely, situations will arise, when the axe is needed, even if only as temporary solution or in extreme cases or unforeseen cases.

    My opinion is that you shouldn’t use the axe if other options are available, but if you have to, then make sure you know how to use it, and take all the necessary safeguards.

    P.S. We can’t make perfect software just yet, sometimes the axe will be needed despite everything else, and no matter how much we dislike taking that path, it comes with being a professional.

  17. Eamon says:

    @Anon: the other context referred to was processes on the OS.

    ThreadAbort as-is is problematic because you don’t know the state of the overall process (or appdomain) when the thread aborts: perhaps the thread was doing something critical to global state and leaves it behind corrupted.

    Now, there are (at least) two solutions to this problem:

    (A) manually mark safe stopping points.  This is a huge effort; it’s likely to be buggy at least some of the time.  Even if you do pick "safe" stopping points, that’s not a cure-all: you’ll be terminating via a code-path not commonly travelled in the single-threaded scenario, so you’ll still need to manually inspect each clean-up case. or trust to automatic mechanism such as try…finally.

    (B) partition and encapsulate updatable regions of your memory.  If a thread goes down asynchronously, assume any partition the thread is currently touching is simply trash; use an external tool to enforce the release of resources held by that thread.  This is essentially what the OS does when processes go down.

    (B) isn’t a magic bullet – resources need to be owned by a thread to be released (often a stack-unwinding release ala using clauses will work here, but a more robust method is conceivable).  Also, you do need to partition *write* access to memory.  In practice, manageable multithreaded programs do that anyhow; you don’t go around writing everywhere unless you like headaches.

    (A), however, is simply not sustainable.  It’s virtually untestable, unverifiable, and asking for exotic race conditions – did you check all possible exits?  Also, problems with this approach are *not* limited to thread abort exceptions.  In practice, any other unexpected exception might halt excecution in an invalid state – so for code to be robust, it needs to be abortable *anyhow* just to be pragmatically exception safe (which is why even in a single-threaded world things like RAII and using-clauses are needed).  Marking manual stopping points is much, *much* more work and makes code less readable.  It’s also a waste of time if your code is actually largely a side-effect-free functional body of code; and that seems to be the direction multi-threaded code is heading anyhow.  If your code doesn’t have any side effects – that is, the only consistency-critical effect on external state is the final "return" value then it doesn’t matter *when* you shut down the thread *anyhow*.

    We want side-effect free threading anyhow; this notion of "safe" stopping points is a treacherous bog I don’t want to enter if I can at all avoid it.  "using" blocks can deal with external resources the thread really needs to release, and the rest of the state is trash for the GC.

  18. Gabe says:

    The great thing about managed code is that you can safely take the axe to any thread that doesn’t manipulate shared state. Back in the olden days terminating a thread leaked whatever memory it had allocated, leaving files locked, sockets open, and so forth, meaning you couldn’t rely on axing threads as a strategy, so you had to pepper threads with stopping points. Managed code lets you avoid that, so it is reasonable to use thread termination as a way of dealing with things like users canceling a long-running calculation.

  19. Bruno Brant says:

    I have a similar situation here at work. I’m designing a system where we make a "cross-environment method call", which is, of course, asynchronous. The caller, A, delegates this task to my process orchestrator, B, telling B that a process § should be completed (or, B(§)). Now, since the process § is parameter based and can call as many different programs that the user wants, I have no clue how long should § take to complete. In fact, any program P can enter a infinite loop.  

    Now, since A is make a cross-environment call, he deals with atomicity using compensation methods. So if the call to B(§) fails or times out, compensation should be done for every step up to B(§), but not to B(§) itself, since it failed. But, let’s said that what really takes long in dealing B(§) is a database commit. The commit is the last step B executes, and B can no longer expires, even if the timeout has already occurred. By the time B finishes commiting, A has already began compesating other steps in his own process.

    This a liability and, right now, I can only think of dealing with it be fine-tuning the time outs… which is no good at all.