What does it mean when GetQueuedCompletionStatus return ERROR_SEM_TIMEOUT?

A customer asked for assistance interpreting a failure of the Get­Queued­Completion­Status function.

We are observing that Get­Queued­Completion­Status is intermittently behaving as follows:

  • The handle is a SOCKET.
  • The function returns FALSE.
  • lpOverlapped != NULL.
  • Get­Last­Error reports ERROR_SEM_TIMEOUT: "The semaphore timeout period has expired."

That's all the information we have in our log files. We don't know the value of number­Of­Bytes or completion­Key, sorry.

We realize that this is a rather vague question, but when this problem hits our machines, it causes our internal logic to go into a reset state since it doesn't know what the error means or how to recover. Resetting is expensive, and we would prefer to handle this error in a less drastic manner, if only we knew what it meant.

The error code ERROR_SEM_TIMEOUT is a rather bad translation of the underlying status code STATUS_IO_TIMEOUT, which is much more meaningful. It means that the I/O operation timed out.

Colleagues of mine from the networking team chimed in with additional information:

A common source of this error with TCP sockets is that the maximum retransmission count and timeout have been reached on a bad (or broken) link.

If you know that the handle is a socket, then you can use WSA­Get­Overlapped­Result on the lpOverlapped that got returned. Winsock will convert the status code to something more Winsocky. In this case, it would have given you WSA­ETIMED­OUT, which makes it clearer what happened.

Comments (9)
  1. Joshua says:

    Well I could say use the select() model (winsock implements it) but it's just as hard on people who don't know it. Getting write() errors only when calling the next function, which is often read() is mega-confusing.

  2. Brian_EE says:

    It was nice that the customer was honest (and more importantly aware) that they were giving a question with vague details. Sounds like they were trying to do the right thing and just needed a nudge in the correct direction.

  3. Antonio Rodríguez says:

    Some customers understand that time is a valuable thing, and not to be wasted. They save you time and, as a bonus, they use to get their problem solved in a shorter time. On the other hand, most of the customers think that they way to go is giving vague information ("my computer is broken!") and trying to intimidate you so you get their problem fixed faster. Sadly, they achieve exactly the opposite.

  4. icabod says:

    I was half-expecting the response to be "Read the error – The timeout period on your semaphore has expired" until I re-read and saw it was completion ports.  Kind of glad to know it's not that obvious.

    So does this mean that internally a semaphore is used, and the API doesn't realise to translate it to something more useable (would WSAGetLastError do it?)?

  5. icabod says:

    I've re-read the post again and see that Winsock will indeed convert the error for you.  Please disregard that part of my previous comment, but I'm still curious to know if it's actually semaphores used internally (knowing that this would be implementation detail and liable to change).

    [Sorry I didn't make it clear in the article. There was never any semaphore. This was an error in translation. The actual error was "the I/O operation timed out", but the translation function converts the STATUS_IO_TIMEOUT code to ERROR_SEM_TIMEOUT (because you have to convert it to something). -Raymond]
  6. Gabe says:

    One would think that when the mechanism was first created, somebody could have made a Win32 error that means "IO timeout" rather than picking something that just has a similar name.

    A quick search for "ERROR_SEMAPHORE_TIMEOUT" reveals a whole lot of people who are having trouble copying files and such. I doubt I would even consider a semaphore timeout to even be an actual error.

  7. Haegar says:

    I encountered this error quite often on our server-software on Win2003-servers. We just retried the operation and it then failed with a well known error code (what was expected)

    Since Windows Server 2008 the problem is gone. I suspect that's because the TCP/IP-stack was partly rewritten with Longhorn?

    Raymond, you didn't mention what OS the customer used.

  8. Jon says:

    I suspect that the customer will still need to use their reset state, though, rather than mess with default Windows tcp timeout settings.

  9. Marc says:

    How does GQCSEx behave when an I/O operation has timed out for a socket? The docs say if GQCSEx returns FALSE, then no I/O operation was dequeued. I assume this means that ulNumEntriesRemoved will be zero so there is no overlapped structure to inspect.

Comments are closed.