Wait Chain Traversal in WER (For Hangs)

Windows Error Reporting (WER) for hung applications utilizes the Wait Chain Traversal (WCT) technology introduced in Vista in order to help diagnose hangs.  It is used to determine exactly which processes contributed to the hang in order to know which dumps to collect.  Developers need these additional dumps if they hope to fix the root cause of the hang.

In order to determine which processes were involved in a hang, WCT generates a data structure called a wait chain.  The wait chain is similar to a linked list in that it is an ordered collection of nodes.  Traversing the wait chain enumerates all threads involved in the hang.  If the final node in the wait chain ends up pointing back to a previous node, it is called a deadlock.

The nodes in the wait chain alternate between a thread designation and a synchronization primitive.  Clients of WCT use this to determine which thread holds a particular synchronization primitive, and whether that thread belongs to another process.

To illustrate a wait chain, the followign diagram is a sample wait chain involving 3 threads and 2 synchronization primitives (critical section, mutex).  It is not a deadlock as the final node is stuck doing work inside of a ficticious function, ChangeInternalState().

Sample wait chain

Of course, not all hangs involve multiple processes or even multiple threads.  For example, if a thread is stuck in an infinite loop or a Sleep() call, WCT will not report any other threads involved in the hang.  In order for WCT to detect multiple threads in the wait chain, there has to be a clear indication that another thread is responsible for blocking the thread being investigated.  In other words, the mechanism by which a thread becomes blocked has to have a concept of an owner.

As of Windows 7, only the following synchronization primitives can be traced to an owner thread by WCT:

  • ALPC (Local Procedure Calls)

  • COM

  • Critical Sections

  • Mutexes (via WaitForSingleObject calls)

  • SendMessage

  • Process handles (via WaitForSingleObject calls)

  • Thread handles (via WaitForSingleObject calls)

Not every synchronization primitive available in Windows is supported by WCT.  For example, Win32 Events do not appear on this list because they have no clear owner: Which thread in the system is going to fire a particular Event?  If a thread is stuck on a WaitForSingleObject() call with an Event handle, WCT will not be able to determine any further nodes in the wait chain.

Similarly, a call to WaitForMultipleObjects() has no clear owner, even if all handles passed into it do have clear owners: Which of the handles passed in should be tracked?

Armed with the knowledge of which threads are involved in a hang, WER can diagnose hangs that span multiple processes.  In this case, a developer viewing a dump generated by WER for the hang will likely be very interested in a dump of the other processes, in order to determine the root cause.

Unfortunately, many developers that view WER data fall into the trap of only fixing the root cause of the hang.  There are actually two issues that need to be fixed: [1] The root cause of the hang in some other thread (or process) and [2] the block on the UI thread.  It is important to fix the root cause of the hang, but the UI thread should not have been in a position to become blocked on it in the first place.  The user needs to be kept in control of their application!

Comments (0)

Skip to main content