The SuspendThread function suspends a thread, but it does so asynchronously


Prologue: Why you should never suspend a thread.

Okay, so a colleague decided to ignore that advice because he was running some experiments with thread safety and interlocked operations, and suspending a thread was a convenient way to open up race windows.

While running these experiments, he observed some strange behavior.

LONG lValue;

DWORD CALLBACK IncrementerThread(void *)
{
 while (1) {
  InterlockedIncrement(&lValue);
 }
 return 0;
}

// This is just a test app, so we will abort() if anything
// happens we don't like.

int __cdecl main(int, char **)
{
 DWORD id;
 HANDLE thread = CreateThread(NULL, 0, IncrementerThread, NULL, 0, &id);
 if (thread == NULL) abort();

 while (1) {
  if (SuspendThread(thread) == (DWORD)-1) abort();

  if (InterlockedOr(&lValue, 0) != InterlockedOr(&lValue, 0)) {
   printf("Huh? The variable lValue was modified by a suspended thread?\n");
  }

  ResumeThread(thread);
 }
 return 0;
}

The strange thing is that the "Huh?" message was being printed. How can a suspended thread modify a variable? Is there some way that Interlocked­Increment can start incrementing a variable, then get suspended, and somehow finish the increment later?

The answer is simpler than that. The Suspend­Thread function tells the scheduler to suspend the thread but does not wait for an acknowledgment from the scheduler that the suspension has actually occurred. This is sort of alluded to in the documentation for Suspend­Thread which says

This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization

You are not supposed to use Suspend­Thread to synchronize two threads because there is no actual synchronization guarantee. What is happening is that the Suspend­Thread signals the scheduler to suspend the thread and returns immediately. If the scheduler is busy doing something else, it may not be able to handle the suspend request immediately, so the thread being suspended gets to run on borrowed time until the scheduler gets around to processing the suspend request, at which point it actually gets suspended.

If you want to make sure the thread really is suspended, you need to perform a synchronous operation that is dependent on the fact that the thread is suspended. This forces the suspend request to be processed since it is a prerequisite for your operation, and since your operation is synchronous, you know that by the time it returns, the suspend has definitely occurred.

The traditional way of doing this is to call Get­Thread­Context, since this requires the kernel to read from the context of the suspended thread, which has as a prerequisite that the context be saved in the first place, which has as a prerequisite that the thread be suspended.

Comments (22)
  1. mark says:

    Until the last paragraph I wondered how the CLR team would ever be able to make the GC work with an asynchronous SuspendThread. They need to know that a particular thread is stopped and then even change its CONTEXT.

  2. Sven2 says:

    Ian: Couldn't you implement that as a simple queue of events? You just create the event and wait for it. Events are always set by the respective previous owner of the FIFO CS. Management of the event list would have to be in a regular Critical Section of course.

  3. Cesar says:

    @Ian: The reason you didn't see it happen in practice was probably because the thread was suspending itself, and some implementation detail in the scheduler made it act synchronously on the request if the thread making the request was the same as the target.

  4. Henri Hein says:

    I see the advantage to Ian's approach.  To use events, you would need one for each thread.  Saving the event objects can be substantial if there are a lot of workers.  What I don't like about it is that I would then have to restart each thread in order to stop that queue (say, during program exit or some kind of app-wide reset).

  5. Cesar says:

    Thinking on how it could have been implemented: SuspendThread sets a "please suspend yourself" flag on the target and fires an inter-processor interrupt to the core which is running the thread. When returning from any system call or interrupt, the "please suspend yourself" flag is checked, and if it's set, it jumps into the scheduler instead of returning.

    If that's how it's implemented, calling SuspendThread on yourself will always find the flag set on return from the SuspendThread system call, and so jump into the scheduler instead of returning. Call SuspendThread on a thread running on a different core, and the thread can still run for a while before that core receives the inter-processor interrupt and returns from it.

  6. documentation issue says:

    Seems like the MSDN documentation should just be explicit upfront about the suspension happening asynchronously. As is written there is little to suggest the asynchronicity:

       Remarks

       If the function succeeds, execution of the specified thread is suspended and the thread's suspend count is incremented

    The comment on "not intended to be used for thread synchronization" doesn't really help in this regard, as it is immediately followed up by an explanation of the deadlock situation, aka "Why you should never suspend a thread".  So it reads more like a prolog to the deadlock situation rather than the side effect happening asynchronously.

  7. alegr1 says:

    @Ian:

    A few problems with that approach:

    1. If you insert a thread context structure to a list, it needs to have a critical section to protect it. That critical section doesn't have a guarantee of FIFO ordering, so you lose order here.

    2. You have to call SuspendThread on yourself while outside any critical section. And the owner will call ReleaseThread. But the problem here that because of a race condition these calls may get reversed, and then the thread will remain suspended (ResumeThread on a running thread has no effect).

    A simple auto-reset event has a soft FIFO order behavior. Most of the time the ordering will be honored. If you're not running asynchronous I/O, the possibility of falling out of order will be even lower.

    And there is no concept or guarantee of ordering on a multi-processor system, anyway. Two threads may arrive to the EnterMyCriticalSection call in some order, but there is no guarantee that they will arrive to the locking instruction in the same order.

  8. Zach says:

    Since we're on the subject of SuspendThread, I would be very curious to know why when writing a debugger, it is not enough for a thread to be stopped via WaitForDebugEvent in order to call GetThreadContext.  In other words, WaitForDebugEvent() returns and says some thread N received an event.  If you call GetThreadContext() on a handle for N, the results are indeterminate.  You still must first call SuspendThread().

    This is confusing, because if you look at the documentation for ContinueDebugEvent, it says this:

    "After the ContinueDebugEvent function succeeds, the specified thread continues".

    This suggests that the thread is suspended.  Furthermore, on the MSDN page titled "Writing the Debugger's Main Loop" (msdn.microsoft.com/…/ms681675(v=vs.85).aspx) it says this:

    "When the debugging event occurs, the system suspends all threads in the process being debugged and notifies the debugger of the event."

    But my own testing suggests that GetThreadContext() is still not safe to call until you've called SuspendThread.  So something else is at play here.

  9. Ian says:

    A while ago I wrote what I think is one valid use of SuspendThread() outside a debugger. (Yes, I'm a special snowflake :-)). I needed a strict FIFO critical section (and yes, I know about the convoy problem) and I implemented it by having the thread wanting to enter the critical section call SuspendThread() *on itself*. When the current owner of the critical section leaves, it calls ResumeThread() on the waiting thread at the front of the queue. There were a couple of added complications to avoid problems with a thread being pre-empted between asking to enter the critical section and actually calling SuspendThread(). But the crucial point is that a thread only suspends itself, so you know it isn't holding a lock.

    However, I'm now wondering if it's possible for a thread to exit from SuspendThread(hSelf) immediately and then be suspended a few instructions later. I've never seen this happen in practise.

  10. 12BitSlab says:

    @ JM — SuspendThread() was supported in .Net through version 1.1.  It was deprecated in 2.0.

  11. acq says:

    Ian's implementation sounds like a ticking time bomb. FIFO means "a queue." Ian, I'm quite sure you didn't need "a strict FIFO critical section" as such but something to be processed in a certain way for something to be done. So you probably needed a queue, and that is independent of your decision to use "SuspendThread".

  12. Ian says:

    @acq Every man and his dog will tell you "you don't want to do that" when he doesn't have a better way to do "that".

    Essentially what you seem to be saying is that I don't want to use threads plus a critical section to protect the resource; instead I should put the work in a queue and process it serially (so the resource doesn't need protecting). It's a fair point, but there are reasons why I can't do that in this particular case. But if I was starting again from scratch that is indeed how I would do it. Mucking about with thread scheduling is a mug's game.

    As to whether it is a ticking time bomb, it all depends on whether SuspendThread(hSelf) is indeed asynchronous or not…

  13. JM says:

    12BitSlab: don't you mean Thread.Suspend()? That's not related to how managed thread suspension can be made to work for purposes of GC (which is internal to the CLR). Thread.Suspend() is obsolete for the same reason SuspendThread() is a bad idea, but my point was that the CLR doesn't need SuspendThread() (or its managed equivalent) as long as it's willing to compromise on when it can do garbage collection (which seems a reasonable price to pay).

  14. 12BitSlab says:

    @ JM — Yes, you are correct.  Sometimes I switch too often between managed and non-managed code.

  15. IanBoyd says:

    I recently upgraded our development tools. In the intervening years they had deprecated their

       Thread.Suspend();

    and

       Thread.Resume();

    Saying that they are only meant to be used for debugging purposes, and not as a general mechanism to suspend and resume threads.

    They replaced the methods with only:

       Thread.Start();

    which is used to start a thread that was start with CREATE_SUSPENDED.

    Which has caused some changes to no longer depend on the ability to Suspend a thread.

  16. JM says:

    @mark: I haven't consulted the CLR source for this, but they wouldn't *need* SuspendThread() to make this work in the managed world, because, well, it's managed. That is to say, if a thread is running managed code, they can use whatever mechanism they like for cooperative suspension (an APC sounds like a good fit), and if it's running unmanaged code, they leave it alone (or rather, defer the suspend until we've returned to the managed world).

  17. Ian says:

    @alegr1 (2) was the complication I mentioned. ResumeThread() returns 0 if the thread is not yet suspended, so we have to spin (SwitchToThread/YieldProcessor) until ResumeThread() returns 1 to handle the race.

    I overstated the strict FIFO requirement though, and thanks for pointing out that my solution isn't 100% strict. I'm much more concerned that under conditions of high load a thread shouldn't be repeatedly starved by a 'convoy' coming through. It's more important that *every* thread gets to run at roughly equal time intervals than that the overall throughput is maximised. On a multiprocessor system I guess it's very unlikely to be an issue anyway, but this code usually runs on single processor systems.

    @Cesar Very interesting, but unfortunately it's all conjecture. I'd love to know what the real answer is.

  18. Duke of New York says:

    "Don't suspend threads"

    "Why"

    "Because it would be bad"

    "Define bad"

    "Try to imagine all execution as you know it stopping instantaneously and your synchronization contracts exploding at the speed of light.

    "OK, right. That's bad"

  19. voo says:

    @Mark: I can't speak for the CLR, but I do know how Java HotSpot does it (usually): At certain well defined points in the code (before returning from functions, backedges on loops,..) there's code that reads from a specific page in memory. When the JVM wants to stop all threads it removes the read rights from the given page in memory.

    This causes all threads to sooner or later fault, which causes the fault handler to be executed which is where the JVM takes over control.

    This has apparently a lower overhead than reading a flag and then jumping to some code, although the later is/was done by the AArch64 JVM.

  20. voo says:

    @JM: It's not even a compromise really. If you want to do a GC you have to know for every thread exactly where object pointers are (on the stack, in registers,..) – if you had to store that information for every single assembly instruction this would be incredibly large or you'd have to recompute it somehow from fixed points. While theoretically doable it just complicates everything for no good reason.

    In Java you also have the problem of native code working with Java objects (so no GC while any thread doing native code). In .NET I guess unsafe code segments (pointers) might be similarly problematic.

  21. ZLB says:

    @Ian: Could this be a case for Fibers because you need to decide yourself which thread runs next?

  22. Joshua says:

    @voo: For .NET if you're going to manipulate a managed object from native code, it is necessary to pin it from managed code first. This adds a reference (yes a true counted reference) and the GC knows to not even so much as move an object with reference > 0.

Comments are closed.

Skip to main content