Psychic debugging: The first step in diagnosing a deadlock is a simple matter of following the money

Somebody asked our team for help because they believed they hit a deadlock in their program's UI. (It's unclear why they asked our team, but I guess since our team uses the window manager, and their program uses the window manager, we're all in the same boat. You'd think they'd ask the window manager team for help.)

But it turns out that solving the problem required no special expertise. In fact, you probably know enough to solve it, too.

Here are the interesting threads:

  0  Id: 980.d30 Suspend: 1 Teb: 7ffdf000 Unfrozen
ChildEBP RetAddr  
0023dc90 7745dd8c ntdll!KiFastSystemCallRet 
0023dc94 774619e0 ntdll!ZwWaitForSingleObject+0xc 
0023dcf8 774618fb ntdll!RtlpWaitOnCriticalSection+0x154 
0023dd20 00cd03f2 ntdll!RtlEnterCriticalSection+0x152 
0023dd38 00cd0635 myapp!LogMsg+0x15 
0023dd58 00cd0c6a myapp!LogRawIndirect+0x27 
0023fcb8 00cb64a7 myapp!Log+0x62 
0023fce8 00cd7598 myapp!SimpleClientConfiguration::Cleanup+0x17 
0023fcf8 00cd8ffe myapp!MsgProc+0x1a9 
0023fd10 00cda1a9 myapp!Close+0x43 
0023fd24 761636d2 myapp!WndProc+0x62 
0023fd50 7616330c USER32!InternalCallWinProc+0x23 
0023fdc8 76164030 USER32!UserCallWinProcCheckWow+0x14b 
0023fe2c 76164088 USER32!DispatchMessageWorker+0x322 
0023fe3c 00cda3ba USER32!DispatchMessageW+0xf 
0023fe9c 00cd0273 myapp!GuiMain+0xe8 
0023feb4 00ccdeca myapp!wWinMain+0x87 
0023ff48 7735c6fc myapp!__wmainCRTStartup+0x150 
0023ff54 7742e33f kernel32!BaseThreadInitThunk+0xe 
0023ff94 00000000 ntdll!_RtlUserThreadStart+0x23 
   1  Id: 980.ce8 Suspend: 1 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr  
00f8d550 76162f81 ntdll!KiFastSystemCallRet 
00f8d554 76162fc4 USER32!NtUserSetWindowLong+0xc 
00f8d578 76162fe5 USER32!_SetWindowLong+0x131 
00f8d590 74aa5c2b USER32!SetWindowLongW+0x15 
00f8d5a4 74aa5b65 comctl32_74a70000!ClearWindowStyle+0x23 
00f8d5cc 74ca568f comctl32_74a70000!CCSetScrollInfo+0x103 
00f8d618 76164ea2 uxtheme!ThemeSetScrollInfoProc+0x10e 
00f8d660 00cdd913 USER32!SetScrollInfo+0x57 
00f8d694 00cdf0a4 myapp!SetScrollRange+0x3b 
00f8d6d4 00cdd777 myapp!TextOutputStringColor+0x134 
00f8d93c 00cd04c4 myapp!TextLogMsgProc+0x3db 
00f8d960 00cd0635 myapp!LogMsg+0xe7 
00f8d980 00cd0c6a myapp!LogRawIndirect+0x27 
00f8f8e0 00cd6367 myapp!Log+0x62 
00f8faf0 7735c6fc myapp!remote_ext::ServerListenerThread+0x45c 
00f8fafc 7742e33f kernel32!BaseThreadInitThunk+0xe 
00f8fb3c 00000000 ntdll!_RtlUserThreadStart+0x23 

The thing about debugging deadlocks is that you usually don't need to understand what's going on. The diagnosis is largely mechanical once you get your foot in the door. (Though sometimes it's hard to get your initial footing.)

Let's look at thread 0. It is waiting for a critical section. The owner of that critical section is thread 1. How do I know that? Well, I could've debugged it, or I could've used my psychic powers to say, "Gosh, that function is called LogMsg, and look there's another thread that is inside the function LogMsg. I bet that function is using a critical section to ensure that only one thread uses it at a time."

Okay, so thread 0 is waiting for thread 1. What is thread 1 doing? Well, it entered the critical section back in the LogMsg function, and then it did some text processing and, oh look, it's doing a SetScrollInfo. The SetScrollInfo went into comctl32 and ultimately resulted in a SetWindowLong. The window that the application passed to SetScrollInfo is owned by thread 0. How do I know that? Well, I could've debugged it, or I could've used my psychic powers to say, "Gosh, the change in the scroll info has led to a change in window styles, and the thread is trying to notify the window of the change in style. The window clearly belongs to another thread; otherwise we wouldn't be stuck in the first place, and given that we see only two threads, there isn't much choice as to what other thread it could be!"

At this point, I think you see the deadlock. Thread 0 is waiting for thread 1 to exit the critical section, but thread 1 is waiting for thread 0 to process the style change message.

What happened here is that the program sent a message while holding a critical section. Since message handling can trigger hooks and cross-thread activity, you cannot hold any resources when you send a message because the hook or the message recipient might want to acquire that resource that you own, resulting in a deadlock.

Comments (26)
  1. Kip says:

    “But it turns out that solving the problem required no special expertise. In fact, you probably know enough to solve it, too.

    At this point, I think you see the deadlock.”

    You have a lot more confidence in me than I do.  I didn’t really follow any of that.  But then again I don’t do any Windows programming…

    [One more comment like that and I’m bringing back the nitpicker’s corner. You should know the prerequisites for this blog by now. If you don’t meet the prerequisites, then don’t complain that the subject matter is too hard. -Raymond]
  2. Nitpicker says:

    The first rule of nitpicking: do not talk about nitpicking.  The second rule of nitpicking: DO NOT TALK ABOUT NITPICKING.

  3. Kip says:

    Jeez I didn’t realize that counted as nitpicking.. it was meant as a joke (and a compliment, in a way).

  4. Peter Ritchie says:

    Other than holding resources during a message send and that sending a message to a window could block the current thread while another is “awoken”; is the fact that the “window clearly belongs to another thread” significant?

    [It’s not just significant – it’s the root cause! Maybe you need to read the article again, make sure you understand that. -Raymond]
  5. J says:

    "is the fact that the "window clearly belongs to another thread" significant?"

    I guess the code must behave differently when you call SetScrollInfo() for a window owned by the same thread.  The function clearly couldn’t wait for its own thread to process the style change message in that case, because it’d never happen until at least SetScrollInfo() returned.

  6. Andrew Cook says:

    I think I can see what the original app was doing that caused the wedge.

    Thread 0 is the UI thread for FooApp, an app that as part of normal operation displays a logging pane to the user.

    Thread 1 is a background listener thread in FooApp that normally waits for external requests, logging messages as it goes.

    LogMsg’s job is, in addition to — or instead of — writing an entry to a log file, to display a logged message in the logging pane and scroll to it for the user. Many programs I’ve seen follow this pattern.

    In the course of doing its job, the background thread finds a need to log a message ("New FTP connection to /warez/") and calls into LogMsg to report this to the user.

    Meanwhile, the user clicks the close button in the UI, prompting a message to be written to the log ("Shutting down the application…").

    Hence the deadlock.

    Hm… actually, I think I even know which app we’re talking about here.

  7. Andrew Cook says:

    Oh, forgot to mention: The reason a critical section was probably used is to ensure in-order delivery of log messages, and that log messages would not overprint each other. As well as to make sure that scrolling to the bottom actually scrolls to the bottom. And that LogMsg was written either for a single-threaded environment or for one where SendMessage == PostMessage.

  8. Peter Ritchie says:

    Okay, maybe I should be more clear.  The root cause is a thread that owns a critical section is blocked waiting for another thread, that other thread is trying to also own that lock–a deadlock.  You can have that happen without a window being owned by another thread and without the critical section (or any other locking mechanism) there wouldn’t be this deadlock.  Maybe the fact that communicating to a window owned by another thread is a cross-thread call is overly apparent to me…  My question is, over and above the fact that communicating to a window owned by another thread is a cross-thread call, whether a window clearly belongs to another thread is otherwise significant.

    [You already gave the answer in your question: “Communicating to a window owned by another thread is a cross-thread call.” I must be misunderstanding the question. Maybe somebody else can answer it. -Raymond]
  9. John says:

    Your question doesn’t really make sense.  It’s like asking "Aside from the fact that x is significant, is x significant?"  I guess that technically the answer is "no", but it’s not really a useful question to ask.

  10. Peter Ritchie says:

    I’ll take that to mean there is no other significance to the fact that the window is owned by another thread…  i.e. communicating with a window owned by another thread (in the same process) will not introduce other problems when synchronization is not used.

  11. Peter Ritchie says:

    The question was meant not to be “Aside from the fact that x is significant, is x significant” but to be more like “aside from resulting in a cross-thread call and the resulting side-effects of two threads vying for the same lock, is the fact that ‘the window clearly belongs to another thread’ significant”

    The fact that calling SetScrollInfo from a second thread results in a cross-thread call (to the owning thread) was already very obvious to me, “the window clearly belongs to a another thread” seemed extraneous and seemed (to me) to imply something more…

    [It’s obvious. That’s why I said “clearly” and then just for good measure, elaborated as to why the statement is obvious. -Raymond]
  12. Erik Funkenbusch says:

    Psychic Debugging… LOL

    I’ve never quite known what to call my method of debugging.  I’m good at it, but it defies all logic.  I just seem to infer and make illogical jumps and successfully find the problem.

    Now I know what to call it ;)

  13. steveshe says:

    Funkenbush: No, Raymond’s Psychic Debugging is based on purely logial leaps based on knowledge and experience. Fixing things using illogical leaps has another name: Guessing.

    If you’re good at fixing things that way it just means the problems are really straightforward or your are very lucky…

  14. Dave says:

    This has been a very helpful post for me. I am going to name my rock band "Unfrozen Child".

  15. cmov says:

    Nice post, good WTF. Thank you.

    I can think of one way in which the WTF would be even worse; If thread 0 uses MWMO_ALERTABLE to alertably wait on the critical section and thread 1 performs APC on thread 0 before it returns the section. I’ve never tried any such thing before and probably never will ;)

  16. Richard Taylor says:

    I’ve always followed the practice of never touching anything UI related from a thread other than the thread that created the UI element. So if I had been writing this app I would have had my LogMsg function post a message to the window that owns the UI element so it could do any UI updating.

    In .NET 2.0 a "cross thread call" exception will be thrown if you try to do anything to a UI element from a thead that didn’t create the element. You have to use an Invoke to run the UI code on the correct thread.

  17. Jeff says:

    It’s unclear why they asked our team

    Because your team has Raymond Chen on it, and he’s got a big ‘ol economy-size can of debugging whoop-ass!

    Back on topic, this isn’t a Windows-only thing. Lots of people can’t get their heads around the fact "thou shalt not make $GUI_TOOLKIT calls in another thread" in GTK, Tk/TCL, or whatever.

    I even see people saying "but the thread belongs to the same program, why can’t it do that?" and they just don’t understand events, messages, and/or GUI "main" loops at all.

    I even had one person ask why two threads couldn’t be in a critical section at the same time >:-{

  18. xiaoguo ge says:

    I wonder what application the other team was developing. If they needed outside help on a problem like this, I am really worried about the quality of their product.

  19. Dataland says:

    When working within synchronized code, you should only do what needs to be synchronized. From msdn:

     "The SetScrollInfo function sets the parameters of a scroll bar, including the minimum and maximum scrolling positions, the page size, and the position of the scroll box (thumb). The function also redraws the scroll bar, if requested."

    I think that the ISV application should *not* have been fiddling with scroll info/position inside of LogMsg.  Instead, UI interaction logic should have been performed either before or after the code that is sync’d by critical section.

    IMHO the bigger problem here is poorly designed code.

  20. Rune Moberg says:

    "If they needed outside help on a problem like this,"

    Well… Sometimes the obvious solution could be staring you straight in the face, but you won’t grasp the solution until you get outside help. Even just talking with someone (without any verbal input from the other party) can sometimes dislodge the solution…

    Truth be told, many programmers are not comfortable around debuggers. Here, AFAICT, the schools focus on writing code, not debugging. It is assumed that if something compiles without warnings, then everything is honky-dorey. (And java is their main language/platform from the start)

    Actually, the majority of applications are total crap. So you should worry a lot. :)

    Personally I started out with the debugger before I wrote much code. I fixed a range check issue with an application someone else wrote, with only the executable to go by (I didn’t even know, back then, that it had been compiled with TP). At that point (ca 1989) my programming experience was very limited. I doubt many of today’s kids interact much with debuggers.

  21. Krunch says:

    Back on topic, this isn’t a Windows-only thing.

    Lots of people can’t get their heads around the fact

    "thou shalt not make $GUI_TOOLKIT calls in another

    thread" in GTK, Tk/TCL, or whatever.

    This isn’t even a GUI-only thing. While writing threaded networking code I learnt the hard way that

    there is no good way to take more than one lock in arbitrary order. That’s basic CS knowledge (does the dining philosophers problem ring any bell ?) but nobody seems to remember these stuff.

    Quoted from a linked article:

    Hangs and deadlocks are tricky to debug because

    there is no unhandled exception that says, "Look

    at me, I’m a bug!" Instead, the program just grinds

    to a halt and you have to go spelunking to figure

    out why.

    I was just thinking the opposite this afternoon: deadlocks are easier to debug because you can examine the bug live for as long as you want. You don’t need a test case to reproduce it. When your program crashes, you only have a dump (if you are not too unlucky) which doesn’t necessarily contains the information you need.

    Of course this only applies if the bug happens on the developer’s computer. If the deadlock happens on a production system on which you are not supposed to install debugging tools I whish you good luck but it shouldn’t have happened anyway :)

  22. Ian Boyd says:

    It all made sense once you got to the punch line, “…thread 1 is waiting for thread 0 to process the style change message.”  But then looking again at the stack trace: shouldn’t one of the last functions in the stack of thread 1 been a SendMessage?

    Looking at the help for SetWindowLong  i see no mention that it uses SendMessage, or that it requires the window to be pumping messages.

    Otherwise this turns into one of those things that we have to watch out for because it said so on that blog, rather than in the docs.

    The only exception might be if you’re using SetWindowLong(GWL_WNDPROC) to subclass a window, but even then it would only affect the next message, not the handling of any current one.

    Without the tidbit from The Old New Thing blog that SetWindowLong requires the owner to be pumping messages i would never have found the cause of the deadlock. i’d be sitting at SetScrollInfo() in a debugger and asking, “How can SetScrollInfo possibly cause a deadlock?”

    [If the thread isn’t pumping messages, how can it receive the style change message? -Raymond]
  23. Ian Boyd says:

    [If the thread isn’t pumping messages, how can it receive the style change message? -Raymond]

    Ahh, the WM_STYLECHANGED message. "The WM_STYLECHANGED message is sent to a window after the SetWindowLong function has changed one or more of the window’s styles."

    Having a mention of such a message in SetWindowLong documentation would have been very helpful to the original guys – and me if i were faced with the same deadlock.

    Added mention to the community content section of SetWindowLong()

  24. Jonathan says:

    Is it better to avoid using SendMessage from a worker thread to a window in general (and use PostMessage instead), or is this only true when a resource is being held by the worker thread?

  25. BryanK says:

    Jonathan: I’m not sure about "in general", but the very fact that the worker thread hasn’t exited is sometimes a resource.  I.e., say the window’s thread is trying to shut down, and has set an event that all the worker threads are supposed to wait on, and exit if it gets set.  The window’s thread wants to ensure all the worker threads are actually gone (so that the process doesn’t stay running), so it joins each of them after setting the event.

    Now if one of those worker threads got some other notification just before the event was set, and is trying to SendMessage to the window’s thread, but the window’s thread is trying to join the worker thread, that’s the same type of deadlock.  PostMessage, OTOH, would work fine (although the notification would be lost).

    (I’ve run into this several times using .net’s Invoke() from worker threads.  The solution was to use BeginInvoke() (mostly equivalent to PostMessage) and ensure that the worker thread didn’t need to know what the window’s thread did with the notification (so it didn’t need to wait for its processing to finish).  If you did need to wait, you could use the WaitHandle exposed by .net’s IAsyncResult (wait on both it and the shutdown event), but IMO it’s better to make the notification really be a notification, if possible.)

Comments are closed.