Why does CreateFile take a long time on a volume handle?


A customer reported that on Windows XP, a call to Create­File was taking a really, really long time if it was performed immediately after a large file copy. They were kind enough to include a demonstration program:

#include <windows.h>

int main(int argc, char **argv)
{
 HANDLE h = CreateFile("\\\\.\\D:",
                       GENERIC_READ | GENERIC_WRITE,
                       FILE_SHARE_WRITE | FILE_SHARE_READ,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL);
 Sleep(5000);
 return 0;
}

If this program is run on its own, the Create­File completes quickly. But if you copy 1.7GB of data immediately before running the program, then Create­File takes longer. The customer would like to know the reason for this issue and whether there is a way to avoid it.

The reason is that you just copied a lot of data, so there is a lot of dirty data in the disk cache that is waiting to get flushed out. And when you create the volume handle, Windows needs to flush out all that data so that the volume handle sees a consistent view of the volume. Flushing out 1.7GB of data can take a while.

There is no way to avoid this problem because the speed of data transfer to the drive is limited by the drive hardware. It will take N seconds to transfer 1.7GB of data, so the time between the start of the file copy operation and the successful opening of the volume handle will be N seconds. If you want the Create­File to go faster, you could do a Flush­File­Buffers on the file being copied so that the cost of writing the data gets charged to the copy operation rather than the Create­File, but that's just creative accounting. You didn't actually make any money; you just moved it around.

Now, a lot of programs open a volume handle but don't actually read from it or write to it, such as the sample program above. Therefore, newer versions of Windows (I don't know exactly whether it was Windows Vista or Windows 7) defer the flush until somebody actually tries to use the handle for reading or writing. So at least for the sample program above, the Create­File will complete quickly. However, the first read or write operation will be slow.

Again, the total time doesn't change. All that changes is where the cost of the flush is incurred.

Comments (32)
  1. Joshua says:

    So why would a lot of programs open a volume handle but not read or write it? The mind boggles.

  2. Zarat says:

    @Joshua: Maybe to test if the volume actually exists?

  3. Karellen says:

    So why does Windows need to flush out all that data so that the volume handle sees a consistent view of the volume? Why doesn't the kernel just provide a consistent view of the volume anyway, and flush out the dirty data in the background?

    [This means that every call to read from the volume first has to call the file system driver to say "Hey, somebody is reading from sector 5. Were you planning on flushing data to sector 5? If so, tell me what you would have written to sector 5, so I can return that." But the file system can't answer the question "Were you planning on flushing data to sector 5?" until it processes its unflushed actions and gets around to assigning sectors to every pending action, in order to see if 5 was on the list – so you're still doing all the bookkeeping of flushing (but without the flushing). And then you have to remember the answer you gave since you are committed to honoring it. This is a lot of complexity to add to a file system driver for something that is not a common scenario. -Raymond]
  4. Mike Diack says:

    I have to say, I kind of agree with Karellen. I thought the Windows caching system meant that applications saw a consistent view of the filesystem whether or not the physical disc sector writes had yet all been flushed?

    [The application opened the volume, not the file system. If the application accessed the file system, then the caching would be just fine. But it's bypassing the file system and going straight to the volume. -Raymond]
  5. JM says:

    @Karellen: because that's even more expensive (both in development and system resources) than making some programs wait. You're basically asking for the kernel to create a snapshot every time someone opens a volume handle. Even though you could update all the snapshots (and mark them as "up to date") the next time flushes catch up, that's still quite expensive. I'm guessing volume handles aren't opened that often to make this worthwhile.

    [Aha, I see the problem now. The consistent view that apps expect is consistent with the data I just wrote. Sure, the kernel could take a snapshot of the volume at the point the volume is opened, but that's not what apps want. Apps want to see all the data that got written to the volume. Otherwise, your defragmenter will create data loss. -Raymond]
  6. Someome says:

    "Apps want to see all the data that got written to the volume." Thats surely not the expectation. Even defragmenters would happely operate ABOVE tbe block cache. Even for them its very important to operate at the logical-written data (in the böock cache) and not at the physical (in the sectors).

  7. Matt says:

    Opening a handle to a volume is an extremely uncommon and specialist scenario. It's not surprising that Windows hasn't been micro-optimized to speed it up at the expense of adding extra complexity and slowness to NtCreateFile/NtWriteFile.

  8. JM says:

    @Raymond: no, I'm talking eventual consistency. The snapshot doesn't need to be app-exclusive. Once the big file flush has happened, poof, the snapshot is suddenly up to date as well (and all other extant snapshots as well). Of course, when an app wants to write, we must block, because we don't want the file flush to fail just because we got in a write earlier (I mean, we could do that, but that's probably not what people want).

    How could an app complain it's not "seeing all the data that got written to the volume" given that we haven't written it yet? That's what the flush is doing, after all… All the snapshot achieves is that you can start reading a consistent volume sooner.

    [So an app opens the volume, reads some information (from a possibly stale snapshot), does some calculations based on that information, and updates that information. That write will wait for the flush, and now the volume in inconsistent (because the app wrote data based on stale information). -Raymond]
  9. Wait, so does this mean that if a program has volume handle open, the kernel is forced to flush all writes on that volume whenever a write operation occurs to guarantee consistency?  Or at least, that this was the case with Windows XP and earlier?

  10. Bob says:

    @Joshua: DeviceIoControl (eg, to query volume metadata)

  11. Joshua says:

    [But it's bypassing the file system and going straight to the volume. -Raymond]

    And the old UNIX world doesn't have this problem because we put the caching at the volume level instead (most likely, so that all filesystems could share the same cache code). The new UNIX world of dynamic filesystems just decided we don't care. Don't open device nodes of mounted filesystems. Online defrag by writing to device nodes never worked anyway. If you want online defrag, need to implement it in-kernel with ioctl().

  12. jader3rd says:

    If the data was being written with cache write through (ie, bypass the cache and write directly to disk), would it still be a problem here?

  13. alegr1 says:

    The actual question is: why the memory manager+cache is so retarded it happily gives some fleeting file a lot of cache memory by robbing the working sets of the running applications, and causes a lot of page thrashing afterwards?

    Oh well, at least it's not as bad as it was for a while in Vista, where you could not cancel a big copy to a slow USB device quickly, because the file data had to be flushed.

  14. JM says:

    @Raymond: yes, that's the idea. How is that prevented today? The read I just did can already be outdated, unless I issue a lock to prevent that. And if I do, obviously, our hypothetical snapshot implementation again has to block until the data is actually stable, to preserve the same semantics.

    [Presumably, the operating system could delay the flush until the app takes an exclusive volume lock (thereby preventing other writes), but I suspect it does it on first I/O because a lot of apps forgot to take exclusive locks. -Raymond]
  15. JM says:

    Note that (because, again, we are not willing to declare future writes invalid based on what we do now), we'd have to wait for consistency if the application takes any lock whatsoever, because we don't know in advance which sectors are going to be affected. The snapshot implementation would do exactly nothing to eliminate waiting for applications that need atomic updates.

    At this point I feel obliged to remind people that I never argued for this feature to be implemented in the first place. :-)

  16. Dan Bugglin says:

    @Joshua: "Windows takes programs as they are, not as we'd want them to be." – Nick Fury (sort of)

  17. xor88 says:

    Why flush at all? The volume can immediately become inconsistent again when the next write happens. Apps can never rely on seeing a consistent volume anyway. That assumption should be false in practice all the time.

    [In practice, the app will most likely unmount the file system before opening the volume. That ensure nobody else is writing to the volume. -Raymond]
  18. Myria says:

    I just wish that there were a CreateFile3 that allowed you to put the creation itself onto an I/O completion port for asynchronous operation.  Creation is one of the last few file operations to remain synchronous.  Maybe it could work like AcceptEx, where you would create an unassociated file handle first, then call CreateFile3.

  19. Joshua says:

    @xor88: I've seen valid use cases. They all involved removable media, which means the filesystem had better be consistent by the time the last close() returns and you could basically rely on the user not to screw it up.

  20. Random832 says:

    Why can't it just read back from the cache to provide a consistent view?

    [My guess is that (1) the data may not yet be in the cache in a usable form [see previous comment], (2) the intent is to provide direct volume access with no caching. -Raymond]
  21. alegr1 says:

    @Random832:

    >Why can't it just read back from the cache to provide a consistent view?

    Because you can't easily/quickly translate from the volume LBA to an open file view offset.

  22. alegr1 says:

    A cache behavior that would make sense for removable devices: limit total amount of dirty data, so that it's never more than 1 second old. Though this could be difficult for writeable mapped sections open on it.

  23. Karellen says:

    "The application opened the volume, not the file system."

    Doh! Of course. Sorry, I totally missed that.

    Now I'm not sure why it's necessary that the volume be consistent with the filesystem. Why isn't it OK if the volume doesn't just represent the un-flushed state of the disk?

    I'm not sure why people are talking about snapshots either. In a pre-emptive multitasking OS (so, uh, all of them in the last 20 years) you can basically never rely any answers you get back about the state of a filesystem. Between the time a query returns, and your *very next line of code* which examines the answer, files could have been added, deleted, renamed, resized, changed ownership, changed access permissions, or possibly had something else happen to them that I've forgotten. You need to be able to handle the filesystem changing under you.

    About the only thing that can save you from this (as JM has mentioned) are filesystem-based locks. But given that a volume, and a file on a filesystem in a volume, are two different objects, you can't use a lock to synchronise access between them anyway. So, uh, if you're not able to handle the volume and the FS on the volume being inconsistent, you're probably in trouble anyway, aren't you?

  24. Paul says:

    Does the NCQ functionality in modern HDD/SSDs have an effect on this or is it too close to he hardware to know if the command could be prioritised?

    My understanding is limited but I am under the impression it allows commands (such as CreateFile and however that is implemented by the SATA protocol) which are quick to perform to be moved up in the queue ahead of slower commands such as the writing of data to disk in order to appear more responsive to the user/application.

  25. @Paul says:

    You are way too low. The volume is managed by the OS at least with a Volume Manager and maybe with several levels of drivers (HD, USB, Network, encryption, Software RAID, Volume Shadow Copy Service etc etc). An app that opens a volume is still operating at a logical view provided by the OS.

    Because of this, all this "the OS has to flush something first" is nonsensical. As long as the app does not have exclusive access to the volume, there is no consistent data to read because still other processes  and the OS itself is changing things at any time, regardless of caching.

    An app operating at the volume level needs exclusive access, or the volume (and in turn, the filesystem at the volume, if there is any) must be mounted read-only. (A read-only view can also be achieved through Volume Shadow Copy.)

  26. Someone says:

    To expand at the previous post (@Paul): For example, chkdsk.exe is able to check the filesystem inside a Truecrypt volume, even if the TC container file is provided by a network share. chkdsk is operating at the volume level, but does not access sectors directly. It goes through the Volume Manager. And it needs exclusive access to the volume (by the /f parameter) to do anything meaningful.

    If some app is opening a volume with GENERIC_WRITE, I would demand the OS to fail the call as long as the volume is opened for WRITE by any other part of the system.

    If some app is opening a volume only for reading, I would expect the OS to succeed the call but leave any cache as it is, because flushing the cache at this point does not provide a stable, consistent view anyway.

    An filesystem-level cache must be flushed at unmounting, or when switching the mounting to read-only, but not when some app opens a volume handle.

  27. sense says:

    I believe that there is a very good reason why the system is designed this way. (I trust the one who designed it knew the matters disscussed in comments) But I cannot quite grasp it.

    When someone is opening a r/w shared handle, and when the drive is mounted/active, is it possible to provide a consistent view at all? If yes, exactly how? If not, what's the reason behind the wait? (Is it a historical liability?)

    If the filesystem is unmounted – which is the only way I can imagine a consistent view can be achieved – then the wait has already happened while unmounting.

  28. @sense says:

    "When someone is opening a r/w shared handle, and when the drive is mounted/active, is it possible to provide a consistent view at all?"

    No, because just opening a handle does not stop other processes (like the indexing service, or any other process) to change file data, which in turn causes to OS to write the changed blocks to the volume at unpredictable times.

    The example provided in this post makes no sense to me: When the filesystem is mounted (the user has just copied 1.7GB of file data), then this CreateFile with GENERIC_WRITE must fail, because the filesystem has exclusive access to the volume (it has not used FILE_SHARE_WRITE).

  29. Dave Bacher says:

    Not disk utilities — page file.

    On Linux, you have a dedicated swap volume.  Resizing that volume impacts adjacent volumes, and in a worst case, can be painful as the volumes are resized (mostly full volumes, etc.).

    An alternative to that is allocating the bytes through the file system, then locking that range, and asking the file system what addresses on the disk are in use.  You then bypass the file system, and just write straight to the drive.  These daemons don't require anything else in the stack — they just require exclusive access to the sectors that they own.

    Those ranges don't need to be backed up, they don't need transactions — and the only security they need is "my precious, don't touch my precious."  So potentially — if you have the file system blocking all other access to those bytes, then you can just go straight to DASD.

    There would be other similar use cases where individual apps might not want the file system involved in their dealings with a file; I suspect that's why it's share write.  

  30. I guess the tricky part of this post was the "\\.\D:". It is "\." not "\?". But maybe I am not making any sense?

  31. Paul says:

    Thank you for the explanations, I had missed that in this scenario the program is asking for a handle to the entire volume rather than just one file, so it makes sense that anything else asking for a handle on that volume would be blocked.

    Is it still too low level if the handles are for two different files on the same disk? I would assume based on you're replies that it then comes down to the specific drivers implementation's that are used in the stack to get to the raw drive I.e. the SSD driver and the NTFS file system driver may be able to carry the context to the drive for prioritisation, or indeed perform prioritatsion themselves before passing anything lower down and must do otherwise all our computers would be unusable due to the amount of drive access contention and blocking going on. but as you said in another scenario it may not such as a USB key that doesn't implement NCQ anyway, or a poorly written driver which doesn't implement any form of prioritisation of it's workload at all.

  32. @paul says:

    CreateFile, WriteFile, ReadFile etc are processing and changing blocks of Volune data, in a filesystem-specific way (NTFS, FAT blabla). There can be caching at any layer: At the drive, in a RAID card or RAID driver, at the volune level and (for data structures like free-space bitmaps, inodes/MFT entries, directory items, whatever) at the filesystem level.

    An app accessing a volume directly (if that is possible for a mounted volune) would bypass the filesystem (and any cache at that level), but not any caching below the volume (like NCQ or RAID or drive-internal caching).

Comments are closed.

Skip to main content