If I issue a second overlapped I/O operation without waiting for the first one to complete, are they still guaranteed to complete in order?

A customer had a question about the order in which overlapped I/O will complete.

WriteFile(hFile, buffer1, buffer1Length, ..., &overlapped1);
WriteFile(hFile, buffer2, buffer2Length, ..., &overlapped2);

Assume that the hFile handle is opened as FILE_FLAG_OVERLAPPED. Is it guaranteed that buffer1 will be written to the file before buffer2?

The file system team replied that there is no such guarantee. That defeats the point of opening the file as FILE_FLAG_OVERLAPPED.

The point of overlapped operations is that you can have multiple operations in flight, and they will be performed in whatever order the I/O subsystem chooses. The second write may be performed before the first if the I/O subsystem thinks that would be faster. For example, maybe the second write can be coalesced with another write to the same sector. Or the disk head happens to be positioned such that seeking to the first buffer position causes it to pass the second buffer position, so the drive figures, "Well, while I'm here, I may as well write out this data, so I don't have to seek back later."

The customer clarified. "Our application uses overlapped I/O for performance purposes. Our actual scenario is more complicated than what we wrote, but the basic idea is that we are receiving data and writing it to a file. We require that the data be written to the file in the order issued. Do we have to wait for the first I/O to complete before we issue the second one?"

Yes. If the order in which the operations are performed is important, then you need to serialize them yourself. But assuming that the two writes are to non-overlapping ranges in the file, why do you care what order they are performed? At the end of the day, buffer1 will be written to the location specified by overlapped1, and buffer2 will be written to the location specified by overlapped2.

The customer explained some more: "In both calls to Write­File, the offset is set to 0xFFFFFFFF`FFFFFFFF, which means that the writes append to the file. Does this change the answer?"

Sort of, but not in a good way. The two I/O operations race into the I/O subsystem, and there's no guarantee that the first one will reach the I/O subsystem first. That part hasn't changed. On the other hand, since both operations are writing to the end of the file, the operations will be serialized once they reach the file system, so you are getting the worst of both worlds: Not only are the results unpredictable, you lose parallelism.

Note also that the completion callbacks may be called in an order different from the order in which the operations actually completed. In other words, it's possible that operation 1 completes before operation 2, but your completion callback for the second operation is called before the completion callback for the first operation. There is no serialization of completion callbacks. They race out of the I/O subsystem the same way that they race in!

Curiously, the customer says that they are using overlapped operations for performance, but then they end up not wanting all the benefits that overlapped operations offer in the first place, namely letting the I/O subsystem reorder operations to improve performance. It's possible that they read somewhere that overlapped operations offer higher performance, but didn't understand what that meant. "We pass this flag because the flag means GO FASTER."

Comments (17)
  1. skSdnW says:

    If they insist on using async I/O they should switch from the -1 offset trick to just storing the offset themselves and use InterlockedCompareExchange to grab free blocks in the file. I bet that they don’t actually log that much data that fast and could just write it out synchronously on a background thread or something like that.

  2. Vilx- says:

    Perhaps the performance gains they are hoping for aren’t from executing two IO requests in parallel, but rather having an IO request execute in parallel to their own code.

    1. That’s a sign that they’ve misunderstood the programming model – I/O requests go to/from cache that’s flushed out in the background, not directly to/from the device. Writes should only block if you’re issuing enough of them that you’re overwhelming the device that you’re writing to – in which case, you need a more sophisticated I/O strategy anyway.

      Overlapped I/O is useful when you can issue multiple independent requests, and don’t care about ordering – you might as well pass on the information about requests to a lower level, and let it care if it’s got reason to care about ordering.

  3. Darran Rowe says:

    The only time I ever used -1 in an OVERLAPPED structure was when I wanted to write to a file, but not rely on the current file pointer, (note, the file wasn’t even opened with FILE_FLAG_OVERLAPPED). But to this day, I still feel that if I had more time, I would have come up with a better solution.

  4. Nico says:

    I laughed out loud at the snippet for this entry on the index page: “Of course not. That’s why it’s called “overlapped.”” That pretty much sums it up! :)

  5. alegr1 says:

    A WriteFile that extends the file written length is synchronous anyway and serialized against another such operation. It’s completely pointless to use OVERLAPPED I/O. Moreover, if the file is cached, WriteFile is synchronous in most cases anyway.

    1. cheong00 says:

      Agreed. If the writes are not that big, just write them all to buffer and let the SATA controller do the reordering.

  6. asdf says:

    Has there been any discussion of the IO system exposing an API similar to a CPU’s memory barriers (or cache coherence protocol in general)? And by barriers I mean control over the order the device commits it to permanent storage (as opposed to an intermediate a ram buffer).

    Fsync is often too much and you have devices pretending to fsync to gain speed at the expense of correctness but something that controls the order of reads or writes is often adequate enough and probably wont encourage device vendors to play such stupid tricks.

    1. Barriers don’t really exist at this level. Taking SCSI (which I know more about), each command gets a tag (which can be up to 64 bits in size, depending on the implementation – parallel SCSI was 8 bits), and you’ve got three relevant command groups:

      * Normal reads and writes; these just transfer data, and have no special behaviour. It’s legal for such a command to complete when it has put data in the cache.
      * Cache flushes; these ensure that the cache and the disk are in sync.
      * Forced Unit Access writes; these commands are special because they cannot complete until the data is on the disk.

      You then have two relevant queueing modes:

      * Normal command queueing – device is free to reorder, as long as it eventually executes the command.
      * Ordered command queueing – all commands queued before an ordered command must complete before it does, all commands queued after an ordered command must wait for the ordered command to complete.

      Within the rules provided by SCSI, a drive can do as it sees fit; so, for example, if I send 100 normal writes, then a single ordered write, then 100 normal reads, I can guarantee that the reads will see the data as written by the preceding 101 writes, but the data may not be on disk. If I send 100 FUA writes, then a single ordered FUA write, then 100 reads, I can guarantee that all 101 writes are on disk before any of the reads start. If I send 100 normal writes, an ordered cache flush, 100 normal reads, then one FUA write, I can guarantee that the writes before the flush are on disk before the reads, and that the FUA write either never completes, or hits disk.

      1. Of course, this doesn’t help with cheating – FUA costs time, as does cache flushing, and it’s cheaper to NOP cache flushes and turn FUAs into normal accesses than it is to do the right thing, and vendors will try and get away with as much as they can.

        At least one vendor (no longer extant) has had their SCSI HDDs treat FUA as “complete command when in cache, schedule cache write back to take place immediately”. If you had power loss shortly after the FUA completed, the drive didn’t write to disk, and the “safe” data was lost; the correct implementation would be to move the “complete command” action until after the cache write back completed, but the vendor didn’t want to do that because their implementation had no way for them to wait for a part of the cache to be written back.

  7. Killer{R} says:

    What you’re telling contradicts to what MSDN tells for overlapped sockets:
    “wand the send functions can be invoked several times to queue multiple buffers to send. While the application can rely upon a series of overlapped send buffers being sent in the order supplied, the corresponding completion indications might occur in a different order”
    … and though sockets are not filesystem, but stream sockets mostly use same IO subsystem. So actual IO operations appears in same order as requested, but completion routine can be called in any order. Thats sounds logical..

    1. Darran Rowe says:

      Still, you can’t infer what should happen with disk i/o based on the documentation for socket i/o.
      In fact, the lack of documentation on what happens for disk files itself is important, because it means that it is undefined. So yes, the documentation does allow for reordered operations if it wants.
      One final thing to remember is that that just because sockets and files have handles that can be used by WriteFile, the objects backing them can be vastly different, and they will eventually get serviced by totally different drivers, each having different capabilities and requirements.

    2. Darran Rowe says:

      As a bit of an addendum, the closest I have even gotten to documentation on this is in the WDK. On the page https://msdn.microsoft.com/en-us/library/windows/hardware/ff565534(v=vs.85).aspx there is the statement:
      “By default, the I/O manager does not maintain a current file-position pointer. This default provides efficiency—because correctly maintaining the current file position requires the I/O manager to synchronize every read and write operation on the file object.”
      This pretty much says that the synchronization and ordering is only guaranteed when the file pointer is being maintained (synchronous i/o). For asynchronous i/o, it isn’t even documented properly in the WDK whether it will even synchronize the operation.
      So the only thing we can get from this is that there is inferred documentation that the disk i/o is documented as undocumented. This is specific to file access too.

      1. Killer{R} says:

        Notes about file pointer and lock are also sounds logical from common sense point of view.. OK lets consider question from other side. Let think I’m implementing some disk driver, that does its job by responding on SCSI commands, like SCSIOP_READ/SCSIOP_READ16/SCSIOP_WRITE/SCSIOP_WRITE16. So question from the other side sounds so: should my SCSI driver do actual IO in same order as requested by kernel or it doesn’t matter and kernel surely cared about that before sending that requests to me?

        1. Killer{R} says:

          And considering answer is ‘you may execute requests in any order’ – why in this case are there so special parameters to handle IO requests queueing (if device support it) in SCSI_REQUEST_BLOCK?

          1. Darran Rowe says:

            For me at least, the most logical and common sense answer would be to offload the synchronization onto hardware, otherwise the kernel would have to do this by using some kind of synchronization object in software. So it is for efficiency gains in dealing with synchronous access.

  8. Chris Chilvers says:

    Though there could be 2 meanings to performance here, either faster I/O or non-blocking asynchronous operations on the GUI. I could see a case for wanting single threaded asynchronous writes if the writes are being issued from a GUI thread. If that’s the case I guess they were hoping the kernel was providing some sort of FIFO queue for a simple producer/consumer scenario.

Comments are closed.

Skip to main content