Swamping the thread pool with work faster than it can drain


This scenarios is real, but details have been changed to protect the guilty.

Consider a window showing the top of a multi-page document. The developers found that when the user clicks the down-arrow button on the scroll bar, the program locks up for 45 seconds, over a dozen threads are created, and then suddenly everything clears up and the window displays the final paragraph of the document (i.e., it scrolled all the way to the bottom).

The problem was traced to queueing tasks to the thread pool faster than they can drain.

The document is an object which, unlike a window, has no thread affinity. (Naturally, there are critical sections around the various document methods so you don't have corruption if two threads try to modify the document at the same time.) The way to display a different portion of the document is to call a method which changes the viewport location.

When the user clicks the down-arrow button on the scroll bar, the main thread creates a periodic timer at four fifths of the double-click speed, and each time the timer elapses, it does the equivalent of document.ScrollDown(). The code cancels the timer once the user releases the mouse button.

The catch is that the document was so complicated that it takes a long time to change the view top and redraw the new view. (I don't remember exactly how long, but let's say it was 700ms. The important thing is that it takes longer than 400ms.)

Given that set-up, you can see what happens when the user clicks the scroll down-arrow. The initial scroll is initiated, and before it can complete, another scroll is queued to the thread pool. The document view keeps trying to update its position, but the periodic timer generates scroll requests faster than the document view can keep up.

If that description was too terse, here's a longer version.

The code for scrolling went something like this:

OnBeginScrollDown()
{
 // Start a timer to do the scrolling
 CreateTimerQueueTimer(&htimer, NULL, ScrollAgain, NULL,
     0, GetDoubleClickTime() * 4 / 5, WT_EXECUTEDEFAULT);
}

OnEndScrollDown()
{
 if (htimer != NULL) {
  DeleteTimerQueueTimer(NULL, htimer, INVALID_HANDLE_VALUE);
  htimer = NULL;
 }
}

ScrollAgain(...)
{
 document.ScrollDown();
}

(In reality, the program didn't use the CreateTimerQueueTimer function—it had a custom timer queue and a custom thread pool—but the effect is the same.)

At time T = 0, the user clicks on the scroll bar down-arrow. The UI thread starts the timer with an initial delay of zero and a period of 400ms. The timer fires immediately, and a thread pool thread is asked to run ScrollAgain. The ScrollAgain function calls ScrollDown, which begins the process of scrolling the document.

At time T = 400ms, the periodic timer fires, and a new thread pool thread is created to service it. Pool thread 2 calls ScrollDown() and blocks.

At time T = 700ms, the ScrollDown call on pool thread 1 completes, and now pool thread 2 can begin its call to ScrollDown().

At time T = 800ms, the periodic timer fires again, and pool thread 1 (now idle) is asked to handle it. Pool thread 1 calls ScrollDown() and blocks.

At time T = 1200ms, the periodic timer fires yet again. This time, there are no idle threads in the thread pool, so the thread pool manager creates yet another thread to service the timer. Pool thread 3 calls ScrollDown() and blocks.

At time T = 1400ms, the ScrollDown() call issued by pool thread 2 completes. Pool thread 2 now returns to idle. Now the call to ScrollDown() from pool thread 1 (issued at time T = 800ms) can start.

At time T = 1600ms, the periodic timer fires again, and pool thread 2 is chosen to service it. Pool thread 2 calls ScrollDown() and blocks.

At time T = 2000ms, the periodic timer fires again, and a new pool thread is created to service it. Pool thread 4 calls ScrollDown() and blocks.

You can see where this is going, I hope. Work is being generated by the periodic timer at a rate of one work item per 400ms, but it takes 700ms to carry out each work item, and the tasks are serialized on the document. It's like Lucy in the chocolate factory. The document is frantically trying to carry out all the work, and it never manages to catch up. Eventually, the document scrolls all the way to the bottom, and the mass of pent-up calls to ScrollDown() all return immediately since there is no more scrolling possible.

Now that the document is idle, it can paint, and that's where the user finally sees the document, scrolled all the way to the bottom.

There are a number of possible solutions here.

One way is not to queue up another scroll while an old one is still running. Instead, just wait for it to finish, and then issue a new scroll that accumulates all the scrolling that had taken place while you were waiting for the first to complete. This results in jerky scrolling, however, and it creates a lag of up to 700ms between the user releasing the mouse button and scrolling actually stopping.

Another approach is to disable repainting the entire document when you detect that you are in the document is too complex to scroll quickly case and just scroll the scrollbar thumb. When the user stops scrolling, re-enable painting and boom the document appears at the user's chosen location. This preserves responsiveness, but you lose the ability to see the document as you scroll it.

I don't know what solution the customer finally went with. I was just there to help with the debugging.

Bonus example: Larry Osterman describes another situation with the same underlying cause.

Hidden take-away: Observe that both of these examples illustrate one of the subtle consequences of a design which moves all processing off the UI thread.

Update: Note that Set­Timer wouldn't have helped here.

case WM_TIMER:
  if (wParam == SCROLLTIMER) {
    QueueUserWorkItem(ScrollAgain, NULL, WT_EXECUTEDEFAULT);
  }
  ...

Since the processing has been moved off the UI thread, the WM_TIMER messages are free to keep flowing in and queue up work faster than the background thread can keep up.

Comments (9)
  1. Alex Grigoriev says:

    Wait, so the callbacks on the same timer object are NOT serialized? That a single reason not to use this abomination. Use old good SetTimer on your own message loop thread instead.

    And why does one need to repaint on every timer? Repaint as you can keep up, but use a new position during WM_PAINT. Also, ScrollWindow(Ex) is there for a reason. There is no need to invalidate and repaint the whole client region. Somebody PLEASE tell WinDbg developpers about this function FAST.

    [(I thought I mentioned in the article that it was a custom timer queue, not a standard one.) And "good old SetTimer" runs on the UI thread, and the goal was to move processing off the UI thread. As I recall, painting was fast. It was all the bookkeeping related to moving the viewport (layout) that was slow. Mind you, this story is from over a decade ago, so I don't remember the details. Of course, moving the viewport shouldn't trigger layout, but if that bugs you, just change the scenario: The scrollbar controls the magnification of a Mandelbrot set. Which is better, leaving some aspects of the original scenario intact (for realism) or changing the fundamentals of the scenario (so people won't get distracted by side issues)? -Raymond]
  2. David Walker says:

    Hmm….  When I am connected to another computer using any of the various VNC programs, I often find that a single click in the middle of the right-hand scroll bar of ANY document (where I am trying to page down by one page) results in multiple scroll-one-page-down commands getting executed.  I wonder if that's what is happening here.

    Most of the VNC programs have some code to rate-limit mouse events, but I don't think multiple mouse clicks are the cause of what I am seeing.  What is described here might explain what I am seeing.

  3. Alex Grigoriev says:

    @DW:

    VNC-style remote access programs don't usually work like RDP. They mirror the screen state; a program's (re)drawing speed is limited by the program's host.

    RDP, though, is sending actual drawing commands over the wire.

    Also, Shift+Click on the scroll bar position will get you there at once.

  4. Alan B says:

    We used to see the same problem when using .NET's System.Threading.Timer.  But, now in our code reviews we treat periodic timers as unsafe until proven otherwise.  Using a one-shot timer and making the callback be responsible for rescheduling if and when appropriate forces the developer to think about what solution they want to the problem.

  5. Henning Makholm says:

    I must be missing something here. If the heavy work is happening off-thread, then why doesn't this happen:

    "At T=200 ms, the mouse button is released, so the UI thread cancels the periodic timer, and the whole avalanche is prevented."

    [I never did figure out that part. My guess is the attempt to cancel the periodic timer itself requires the document critical section. -Raymond]
  6. The best part of the article was the video. Had a good laugh. Nice analogy.

  7. Faster code please says:

    This is again a "can anyone please fix my sloppy code". Please tell them to write faster code. Even a mandelbrot can render in sub 400ms time if doing it right.

  8. Mike D says:

    @Faster code please

    Okay, so now you're showing a 500MB document that's being dynamically streamed from a server on a different continent. Because the file is so huge, you only download the section of the document that you're currently showing, but because of all the embedded fonts and images and stuff, it still takes 500ms to download a single page.

    Some things are slow. Dealing with it is a real engineering problem.

  9. Cheong says:

    @Mike D: Agreed. And let's not take a document that may not contain page label as an example.

    Writing a good datagrid that can open a table containing millions of row, yet can scroll smoothly is a REAL engineering problem.

Comments are closed.

Skip to main content