Ready… cancel… wait for it! (part 1)


One of the cardinal rules of the OVERLAPPED structure is the OVERLAPPED structure must remain valid until the I/O completes. The reason is that the OVERLAPPED structure is manipulated by address rather than by value.

The word complete here has a specific technical meaning. It doesn’t mean “must remain valid until you are no longer interested in the result of the I/O.” It means that the structure must remain valid until the I/O subsystem has signaled that the I/O operation is finally over, that there is nothing left to do, it has passed on: You have an ex-I/O operation.

Note that an I/O operation can complete successfully, or it can complete unsuccessfully. Completion is not the same as success.

A common mistake when performing overlapped I/O is issuing a cancel and immediately freeing the OVERLAPPED structure. For example:

// this code is wrong
 HANDLE h = ...; // handle to file opened as FILE_FLAG_OVERLAPPED
 OVERLAPPED o;
 BYTE buffer[1024];
 InitializeOverlapped(&o); // creates the event etc
 if (ReadFile(h, buffer, sizeof(buffer), NULL, &o) ||
     GetLastError() == ERROR_IO_PENDING) {
  if (WaitForSingleObject(o.hEvent, 1000) != WAIT_OBJECT_0) {
   // took longer than 1 second - cancel it and give up
   CancelIo(h);
   return WAIT_TIMEOUT;
  }
  ... use the results ...
 }
 ...

The bug here is that after calling Cancel­Io, the function returns without waiting for the Read­File to complete. Returning from the function implicitly frees the automatic variable o. When the Read­File finally completes, the I/O system is now writing to stack memory that has been freed and is probably being reused by another function. The result is impossible to debug: First of all, it’s a race condition between your code and the I/O subsystem, and breaking into the debugger doesn’t stop the I/O subsystem. If you step through the code, you don’t see the corruption, because the I/O completes while you’re broken into the debugger.

Here’s what happens when the program is run outside the debugger:

ReadFile I/O begins
WaitForSingleObject I/O still in progress
WaitForSingleObject times out
CancelIo I/O cancellation submitted to device driver
return
Device driver was busy reading from the hard drive
Device driver receives the cancellation
Device driver abandons the rest of the read operation
Device driver reports that I/O has been canceled
I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
I/O subsystem queues the completion function (if applicable)
I/O subsystem signals the completion event (if applicable)
I/O operation is now complete

When the I/O subsystem receives word from the device driver that the cancellation has completed, it performs the usual operations when an I/O operation completes: It updates the OVERLAPPED structure with the results of the I/O operation, and notifies whoever wanted to be notified that the I/O is finished.

Notice that when it updates the OVERLAPPED structure, it’s updating memory that has already been freed back to the stack, which means that it’s corrupting the stack of whatever function happens to be running right now. (It’s even worse if you happened to catch it while it was in the process of updating the buffer!) Since the precise timing of I/O is unpredictable, the program crashes with memory corruption that keeps changing each time it happens.

If you try to debug the program, you get this:

ReadFile I/O begins
WaitForSingleObject I/O still in progress
WaitForSingleObject times out
Breakpoint hit on Cancel­Io statement
Stops in debugger
Hit F10 to step over the CancelIo call I/O cancellation submitted to device driver
Breakpoint hit on return statement
Stops in debugger
Device driver was busy reading from the hard drive
Device driver receives the cancellation
Device driver abandons the rest of the read operation
Device driver reports that I/O has been canceled
I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
I/O subsystem queues the completion function (if applicable)
I/O subsystem signals the completion event (if applicable)
I/O operation is now complete
Look at the OVERLAPPED structure in the debugger
It says STATUS_CANCELED
Hit F5 to resume execution
No memory corruption

Breaking into the debugger changed the timing of the I/O operation relative to program execution. Now, the I/O completes before the function returns, and consequently there is no memory corruption. You look at the OVERLAPPED structure and say, “See? Immediately on return from the Cancel­Io function, the OVERLAPPED structure has been updated with the result, and the buffer contents are not being written to. It’s safe to free them both now. Therefore, this can’t be the source of my memory corruption bug.”

Except, of course, that it is.

This is even more crazily insidious because the OVERLAPPED structure and the buffer are updated by the I/O subsystem, which means that it happens from kernel mode. This means that write breakpoints set by your debugger won’t fire. Even if you manage to narrow down the corruption to “it happens somewhere in this function”, your breakpoints will never see it as it happens. You’re going to see that the value was good, then a little while later, the value was bad, and yet your write breakpoint never fired. You’re then going to declare that the world has gone mad and seriously consider a different line of work.

To fix this race condition, you have to delay freeing the OVERLAPPED structure and the associated buffer until the I/O is complete and anything else that’s using them has also given up their claim to it.

   // took longer than 1 second - cancel it and give up
   CancelIo(h);
   WaitForSingleObject(o.hEvent, INFINITE); // added
   // Alternatively: GetOverlappedResult(h, &o, TRUE);
   return WAIT_TIMEOUT;

The Wait­For­Single­Object after the Cancel­Io waits for the I/O to complete before finally returning (and implicitly freeing the OVERLAPPED structure and the buffer on the stack). Better would be to use GetOverlapped­Result with bWait = TRUE, because that also handles the case where the hEvent member of the OVERLAPPED structure is NULL.

Exercise: If you retrieve the completion status after canceling the I/O (either by looking at the OVERLAPPED structure directly or by using GetOverlapped­Result) there’s a chance that the overlapped result will be something other than STATUS_CANCELED (or ERROR_CANCELLED if you prefer Win32 error codes). Explain.

Exercise: If this example had used Read­File­Ex, the proposed fix would be incomplete. Explain and provide a fix. Answer to come next time, and then we’ll look at another version of this same principle.

Comments (31)
  1. pcooper says:

    My guess for Exercise 1: the IO could have completed successfully between the time that your wait-for-1-second expired and the time you called the CancelIO function, so the status reflects that it actually all worked.

  2. Nathan_works says:

    It's still a race condition waiting for the underlying cancel to complete. Until the event is signaled, not much in the struct can be considered valid..

  3. Adam says:

    For the second one from the documentation "The ReadFileEx function ignores the OVERLAPPED structure's hEvent member. An application is free to use that member for its own purposes…" which tells me that "WaitForSingleObject(o.hEvent, INFINITE);" would be a bad idea.

    I'd think one way to solve that is to allocate the OVERLAPPED structure on the heap, and delete it in the completion callback.

  4. Kyle says:

    For the second exercise (ReadFileEx()), the proposed fix would be incomplete because ReadFileEx() doesn't signal hEvent when the I/O is complete; it just calls (or queues) the I/O completion routine.  In this circumstance, the addition to the fix provided above would be to signal the hEvent member manually in the completion routine.  This way, the OVERLAPPED local variable stays valid since the function still won't return until hEvent has been signaled.

    As an aside, it seems based on fix that I provide above that ReadFile() is essentially doing what I proposed behind the scenes.  Is that actually the case?

    [Assume that the rest of the code has also been adjusted for ReadFileEx. (E.g., WaitForSingleObject changes to a wait on an event set by the completion function.) -Raymond]
  5. Ian Kelly says:

    In addition to the above fix, you have to wait in an Alertable state. So assuming you signal the hEvent in your completion routine, you would need to change the WaitForSingleObject call to WaitForSingleObjectEx(o.hEvent, INFINITE, TRUE).

    [That only fixes part of the problem. Keep looking, or wait for the answer tomorrow. -Raymond]
  6. Kyle says:

    Ian –

    Good point…waiting in an alertable state is absolutely necessary.  It would still be necessary, however, to check the completion state after the WaitForSingleObjectEx() call because other APC's theoretically could have been queued to the thread.  If it's not complete, perform another alertable wait.

    Raymond –

    The MSDN documentation doesn't indicate whether or not GetOverlappedResult() will wait in an alertable state.  I assume it probably doesn't and WaitForSingleObjectEx() must be used?

  7. Maurits says:

    FWIW, experiments seem to indicate that WaitForSingleObject(NULL, INFINITE) returns WAIT_FAILED and calls SetLastError(ERROR_INVALID_HANDLE), which is good.

  8. David Levine says:

    FWIW this illustrates one of the dangers of sharing memory between code, especially across user-kernel boundaries. If an API was used that did not share memory the corruption problem would be avoided. It also shows how incredibly complex canceling IO operations can be – there are lots of use cases and race conditions that have to be accounted for.

    [I assume you're talking about the control buffer. (Copying the I/O buffer can get really expensive for large I/O.) The alternative to sharing the control buffer is having ReadFile return a handle that you then pass to GetOverlappedResult, and then which you have to close. It also means that you need to find some other place to store your reference data. -Raymond]
  9. Adrian says:

    "You're then going to declare that the world has gone mad and seriously consider a different line of work."

    Literally laughed out loud because it's so true.  Thanks, Raymond.

  10. waleri says:

    I think alertable wait should be in a loop.

    Also, I suspect that actual error code after CancelIo() would depend on the driver underneath.

    Correct me if I am wrong, but from my observations, no internal wait is ever alertable.

    I guess it is because the waiting thread should be notified about APC event. Why should it be notified is a whole another story.

  11. Jonathan says:

    Well thanks, now you made me read your article with the voice of John Cleese in my head.

  12. Phil W says:

    I had no idea that Raymond might be a Monty Python afficionado…

    "it has passed on: You have an ex-I/O operation…"

  13. Vilx- says:

    A very interesting topic! This reminds me of another edge case: what about if you call CancelIO and then end your process? I remembered this because of a somewhat related observation that one of the few ways how you can create a truly unkillable zombie process in Windows is to make it hang at some IO. This can get very annoying if the process has locked some other resource (such as a file) that you would really like to access. But until the IO completes (which can take ages in some cases, like a scratched CD/DVD, or a network problem), you're stuck. The only way to reset things is to reboot.

  14. Morten says:

    "You're then going to declare that the world has gone mad and seriously consider a different line of work."

    Oh, many products can do that to you. Now that I think about it, they all can. Remember: down, not across. :-)

  15. Alex Grigoriev says:

    @Vilx-:

    what about if you call CancelIO and then end your process

    When a thread terminates, all IO associated with that thread are canceled. This means the equivalent of CancelIo is called for all of them; thus it's not really necessary to call CancelIo in your test case. A bad driver that doesn't support IO cancellation can cause a thread or a process to get stuck on its unwind. A "bad driver" is also storport.sys that hosts most HBA miniports. It only relies on ULP timeout, which may be very big in some cases. And if you issue SCSI_PASSTHROUGh requests, you specify your own ULP timeout. Go figure.

    @Raymond:

    Part of the confusion is because the SDK documentation doesn't say explicitly that the IO is not completed as soon CancelIo returns; and that you still need to wait for it. The docs vaguely say "The thread can use the GetOverlappedResult function to determine when the I/O operations themselves have been completed." Which, depending on the reader, may or may not be interpreted as requirement to determine when the IO is completed.

  16. Maurits says:

    The only way to reset things is to reboot.

    Would restarting the driver in question also work (that is, can the driver cancel/fail the IRP from below in response to a shutdown request?)

  17. Steve says:

    The second cardinal rule of the OVERLAPPED structure is "You don't talk about the OVERLAPPED structure".

  18. Ben says:

    [The alternative to sharing the control buffer is having ReadFile return a handle that you then pass to GetOverlappedResult, and then which you have to close. It also means that you need to find some other place to store your reference data. -Raymond]

    It seems to me that this would have been the better solution (if the crystal ball had been operating that day). A potential leak is infinitely preferable to memory corruption that cannot be debugged. When is that time-machine due to be ready?

  19. janm says:

    In the ReadFileEx case, when waiting on the event set in the completion function, the return value from "WaitForSingleObjectEx(h, INFINITE, TRUE)" is important.

    If it returns with WAIT_IO_COMPLETION, there is no guarantee that I/O that completed is the I/O this function cares about, so something like "while (WaitForSingleObjectEx(h, INFINITE, TRUE) != WAIT_IO_COMPLETION) ;" would avoid the problem and ensure that forward progress is only made after the completion function executes.

    On the first exercise, the answer is clearly a race condition.

  20. Moi says:

    Raymond – I really enjoy your coding posts. I loved your scrollbar series, the context menu stuff, and all the random one-of or two-of posts. Please never stop!

  21. Andreas says:

    Thanks Raymond for writing these very important articles. Your blog has meant more for me improving my Windows programming skills than any book has. As Moi says, please never stop!

  22. K says:

    That is evil. How can I find something like that, except by being more experienced and smarter than I am?

  23. Bob says:

    Seems to me that in general that it is a bad idea to allocate the OVERLAPPED structure as local to such a low-level leaf function. In actual code, you'd be doing something important while the I/O progresses instead of just immediately waiting for it. (Otherwise, you would have used a synchronous I/O call.)  If that "something important" raised an exception or otherwise exited prematurely, you could have the same corrupt memory problem. Seems that this structure should either be allocated from the heap (as another commenter suggested) or at least be local to a back-stop procedure which catches any internally handle-able exceptions.

  24. Rangoric says:

    @K

    Actually you are experienced and smart enough to find out about it. You are reading this blog.

  25. And says:

    Second exercice

    I think that if you change to WaitForSingleObjectEx, you still have to waith for the overlapped completion routine to finish, so you would have to alloc OVERLAPPED on heap and release it inside the completion routine.

  26. Zarat says:

    Where does the event handle actually get closed?

  27. Alex Grigoriev says:

    @Mauritis:

    A PNP driver doesn't get a remove request until all the IRPs on the stack are completed. It does get query_remove, though, but that doesn't mean to complete all requests. Usually, it's a job of the client app to close the handle on the remove device notification.

  28. Alex Grigoriev says:

    @Zarat:

    The event handle is only closed when the app explicitly closes it (or terminates).

  29. Cheong says:

    [You're then going to declare that the world has gone mad and seriously consider a different line of work.]

    Not exactly.

    If I were able to trace down the problem this close, I'd search on the web and by chance I'll find someone posting explanation like this series.

  30. Josh says:

    Four years too late. I ran into this *exact* problem half a year after I joined Microsoft. I didn't write the code, but I got asked to solve a bug that was caused by this same issue. It was particularly confusing because it only repro-ed on prod builds, and as you noted, didn't trigger and on-write breakpoints in WinDbg. And when it did hit, it overwrote the return pointer on the stack, so the program tried to return to STATUS_CANCELLED. Quite confusing. I spent the better part of a month trying new approaches (we didn't have a good repro initially, so even examining the stack to find the bad return value was out of the question).

    I did eventually solve it, so no, my particularly OS app doesn't suffer from the problem anymore, but damn was it annoying to figure out.

  31. Ben Voigt [Visual C++ MVP] says:

    Please, let's not forget that in addition to keeping the OVERLAPPED structure available for use, the data buffer must remain available too.  So the various proposals to not use the OVERLAPPED structure to store the status haven't actually solved anything.

Comments are closed.