The case of the asynchronous copy and delete


A customer reported some strange behavior in the Copy­File and Delete­File functions. They were able to reduce the problem to a simple test program, which went like this (pseudocode):

// assume "a" is a large file, say, 1MB.

while (true)
{
  // Try twice to copy the file
  if (!CopyFile("a", "b", FALSE)) {
    Sleep(1000);
    if (!CopyFile("a", "b", FALSE)) {
      fatalerror
    }
  }

  // Try twice to delete the file
  if (!DeleteFile("b")) {
    Sleep(1000);
    if (!DeleteFile("b")) {
      fatalerror
    }
  }
}

When they ran the program, they found that sometimes the copy failed on the first try with error 5 (ERROR_ACCESS_DENIED) but if they waited a second and tried again, it succeeded. Similarly, sometimes the delete failed on the first try, but succeeded on the second try if you waited a bit.

What's going on here? It looks like the Copy­File is returning before the file copy is complete, causing the Delete­File to fail because the copy is still in progress. Conversely, it looks like the Delete­File returns before the file is deleted, causing the Copy­File to fail because the destination exists.

The operations Copy­File and Delete­File are synchronous. However, the NT model for file deletion is that a file is deleted when the last open handle is closed.¹ If Delete­File returns and the file still exists, then it means that somebody else still has an open handle to the file.

So who has the open handle? The file was freshly created, so there can't be any pre-existing handles to the file, and we never open it between the copy and the delete.

My psychic powers said, "The offending component is your anti-virus software."

I can think of two types of software that goes around snooping on recently-created files. One of them is an indexing tool, but those tend not to be very aggressive about accessing files the moment they are created. They tend to wait until the computer is idle to do their work. Anti-virus software, however, runs in real-time mode, where they check every file as it is created. And that's more likely to be the software that snuck in and opened the file after the copy completes so it can perform a scan on it, and that open is the extra handle that is preventing the deletion from completing.

But wait, aren't anti-virus software supposed to be using oplocks so that they can close their handle and get out of the way if somebody wants to delete the file?

Well, um, yes, but "what they should do" and "what they actually do" are often not the same.

We never did hear back from the customer whether the guess was correct, which could mean one of various things:

  1. They confirmed the diagnosis and didn't feel the need to reply.

  2. They determined that the diagnosis was incorrect but didn't bother coming back for more help, because "those Windows guys don't know what they're talking about."

  3. They didn't test the theory at all, so had nothing to report.

We may never know what the answer is.

Note

¹Every so often, the NT file system folks dream of changing the deletion model to be more Unix-like, but then they wonder if that would end up breaking more things than it fixes.

Comments (29)
  1. Rick C says:

    How often, if at all, does Microsoft attempt to contact these people and determine if there was a resolution?  If you do it, does it ever succeed?

    [We have no way of contacting the customer. We don't even know who the customer is! Only the customer liaison knows. -Raymond]
  2. laonianren says:

    The old Indexing Service (NT4 through XP) had a similar bug.  If you quickly created and deleted a directory the Indexing Service would retain a handle to it, leaving a zombie directory that you couldn't do anything with.  Killing the service killed the zombie.

  3. dave says:

    Every so often, the NT file system folks dream of changing the deletion

    model to be more Unix-like, but then they wonder if that would end up

    breaking more things than it fixes.

    Do you know the history of 'why' the current model was chosen?  It seems to be more complicated to use, and I assume more complicated to implement. So therefore I suppose it must have been chosen deliberately, but why?

    Aside: fans of arcana may care to read [MS-FSA].pdf for the gory details.

  4. Joshua says:

    Well I know it about the interference being blamed on antivirus. We have had similar incidents for which we gave the same explanation by the same twists of reasoning, but when its our customers that hit it and they are unwilling to remove the antivirus to test it (merely disable doesn't actually work with most antivirus software when you're trying to probe for the problems it causes) so it never gets truly tested.

    [¹Every so often, the NT file system folks dream of changing the deletion model to be more Unix-like, but then they wonder if that would end up breaking more things than it fixes.]

    In my testing, 99% of programs that don't pass FILE_SHARE_DELETE to CreateFile are safe to do so, and 99% of these don't pass it because they create via fopen() which doesn't know. fopen() probably should pass FILE_SHARE_DELETE because it originated on UNIX where that was the behavior.

    Do that little bit, that is, find a way to change the default behavior to FILE_SHARE_DELETE asserted, and I can test for the rest, as to whether or not such a program will actually break. A new flag to MoveFileEx and CreateFileEx to cross-assert FILE_SHARE_DELETE would work just as well.

  5. Interference from AntiVirus and other "grabby" software is the reason why I am paranoid about closing a file I am soon going to re-open nowadays. We've seen cases where we close a temporary file we created and noone else should care about, almost immediately try to re-open it and get a sharing violation. Something "grabbed" it as soon as we let go of it. Not very nice. It was backup software, I think.

  6. KeyJ says:

    Anti-Virus software might be the number one offender in such a situation, but the Shell's video file analyzer is certainly a close second. More often than not, I'm not able to delete video files because Explorer does stuff like compute the duration of the file (which I'm not interested at) or generate thumbnails (which I generally don't use). In Windows XP, there was a simple workaround for this (regsvr /u shmedia.dll), but in Windows 7, the only way to stop the madness is to remove HKCR.<filetype>ShellEx{3D1975AF-0FC3-463d-8965-4DC6B5A840F4} for every possible video file type. Even worse, these entries might reappear after updates :(

    [Explorer uses oplocks to detect that somebody is trying to delete the file and it tells the previewer to get out of the way. I guess some previewers are stubborn. -Raymond]
  7. Adrian says:

    We commonly see this while doing builds.  The linker creates a new .dll or .exe and a corresponding .pdb, and it takes a while for the anti-malware (and other IT-mandated surveillance software) to do their scans of the new files, especially since the build is largely I/O bound anyway.  Meanwhile, in a subsequent build step, the manifest tool tries to update the manifest information in newly-created binary, but it fails because the aforementioned software still has the files locked for the duration of their scans.

    After upgrading the toolchain, we've seen that the manifest tool now backs off and retries a couple times before giving up.  Most of the time, that's enough to make the build successful, but not always.

  8. I've hit this a few times with various backup/sync tools – try building in a synced or backed up directory, you'll occasionally find you can't replace foo.exe because it's being synced at the time.

    I am less irritated by this minor misdemeanor after finding one AV product which actually bugchecked (BSOD) the whole system any time a file was closed which had previously been opened by number. It seemed to create an internal record (the filename?) on open, then free that structure on close … no filename, no buffer, free junk, unhappytime.

  9. alegr1 says:

    Previewers has been a bane of file/directory deletion, in XP. They behave better these days, though.

    And nobody has mentioned yet the infamous "disappearing source file" problem in the VB6 IDE (or VB5?). Yep, that's because of some antivirus.

  10. David Walker says:

    "[We have no way of contacting the customer. We don't even know who the customer is! Only the customer liaison knows. -Raymond]"

    My psychic powers tell me that we should now ask the question that Rick C was really trying to ask, which is the next obvious question:  "How often, if at all, does Microsoft ask the customer liaison to contact these people and determine if there was a resolution?"

    I don't think it's that hard to answer Rick C's question!  It's as if you, Raymond, answered the question "Do you know what time it is?" by saying "Yes, I do."  :-)

    [Usually the developer who answers the question simply answers the question. It's tedious to write "Try the suggestion above and let us know if it worked" at the end of every message. Sometimes the customer liaison will reply "That worked great, thanks" and sometimes they just go silent. -Raymond]
  11. Alois Kraus says:

    How do you rule your Anti Virus software out if company policy does not allow (for good reasons) you to uninstall or even disable it? I had similar experiences with multi threaded file copying where during random times the copy operation did fail although no file was copied two times (see stackoverflow.com/…/privcopyfileexw-bug-in-windows). MS support told me that it was (as always) the virus scanner. But not this time. It was a bug in Windows (even the first Beta of Windows 8 still had it) since Vista. I have not tried out the latest Windows 8 build if the issue is fixed there now.

  12. Michael Grier [MSFT] says:

    If you run into this kind of thing, a trace with call stacks from either sysinternals' Process Monitor or xperf/wpa (capture call the filesystem operations with call stacks) can be very illuminating.  Call stacks are necessary since so many things run as plus-ins/add-ins and some antivirus software runs in-process by shimming all calls through to the OS that do file manipulation.

  13. > We have no way of contacting the customer. We don't even know who the customer is! Only the customer liaison knows

    From an external perspective, "We" is Microsoft, not the Shell team or the Windows team.  In particular "we" includes the customer liaison.

    [Then you'll have to ask a customer liaison blogger. -Raymond]
  14. cheong00 says:

    Do you have access to knowledge-base software used by customer liaison? If you do and is really interested, I suppose you can do keyword search to see if any new entry is added.

  15. Georg Rottensteiner says:

    Not really related (only very tightly), but why oh why was the explorer/shell behaviour changed when pressing delete with a selection of files? In older Windows' (even pre XP I think) the very first thing that would always pop up was the security message "Do you really want to delete …". So my fingers learned for things to delete, press delete and enter very fast after each other (to directly confirm the message)

    Now there is a bigger delay between the time delete was pressed and the confirmation message comes up. And Explorer actually reacts on things in between. This change really bit me hard.

    Once I got a trojan (must've been XP pre SP2), several Britney Spears Game.exe's appeared on a few folders. So I used search to find all instances. Selected them all, pressed delete followed by enter. Yay! Delete was passed of to the shell which did godknowswhat, and the enter press started all selected instances of the trojan. While my PC was going down I saw the shell delete confirmation box come up.

  16. Neil says:

    Older versions of the Vembu backup open file agent did this. I know because in an ironic twist it prevented Sophos Anti-Virus from updating itself! I don't see the problem with a recent version of the software, I think they switched to using volume shadow copies or something.

  17. Ian Boyd says:

    @Georg Rottensteiner  i noticed that nearly immediately, and it has been aggravating me ever since.

    i presume it was done so that Explorer can continue to remain responsive, while preparing a list of thousands of files. i wish it would stop.

  18. Danny says:

    1 – Modern anti-viruses all have exceptions implemented. Exception to files, exceptions to folders, exceptions to protocols, exceptions to whatever you dream of. If you have a predefined work area tell the anti-virus to not touch it. It really helps.

    2 – "// assume "a" is a large file, say, 1MB."</quote>. Heck I create such "large" files hundreds of them each day – actually I create hundreds of 10 MB files each day. 1 MB is hardly a large file with current hardware capabilities. (anti-nitpicker comment – <you use Delphi which is a poor IDE and creates large files just for the sake of it>)

  19. iPad test says:

    post from iPad doesn't seem to work

  20. ISS test says:

    post from international space station does work.

  21. Engywuck says:

    @neil: we still have one of these systems at work (till January or so). Really great, when it randomly decides "this server has seen a lot of file changes recently, better lock them all at the same time and back them up" with some extra-large (>1GB) files mixed in the bunch. While some user is moving directories around…

    We asked for VSS when they last announced to do a new version. Got the answer "can things slow down in some circumstances, won't do it". Yeah, sure….

  22. AUS test says:

    lnɟssǝɔɔns ʎllɐıʇɹɐd ɐılɐɹʇsn∀ ɯoɹɟ ʇsoԀ

  23. Skyborne says:

    @Georg Rottensteiner ~ I just disable the delete confirmation.  If I really screw up, these days I have a 40+ GB recycle bin.

  24. Brian_EE says:

    @Alois: "How do you rule your Anti Virus software out if company policy does not allow (for good reasons) you to uninstall or even disable it?"

    You install <insert favorite PC virtualization tool name here> and create one or more virtual machines (of various flavors) to test your software on.

  25. Joshua says:

    @Brian_EE: Even better, visualize the originally provided image, then leave it running but don't use it.

  26. Random User 43896 says:

    The other fun one for my group was the anti-virus that bogged everything down, but only when it was disabled. Apparently the scanning engine wouldn't shut off the "drivers" when it stopped listening, and the drivers would sit and wait for… something?

    The solution was to either uninstall the AV, or never *ever* disable it. (For reasons unknown, the client had deployed AV to _almost_ all the servers, and then intentionally configured it to be disabled on specific machines. Instead of including them in the "don't install" list.)

  27. Dale says:

    Question:

    Does Microsoft work with AV vendors, to encourage them to use Oplocks?  I know of one current corporate AV product which does exactly what Raymond posts above.  Plus, it prevents you from ejecting USB media "as it's in use".

  28. That sounds like a problem we were having. Perhaps you "had to (write) that because of me".

  29. Neil says:

    Part of the configuration detection script used to build Firefox includes autoconf macros that quickly compile and delete executables. Every so often, this fails. In January they finally gave up and added in a 1 second delay to the script. Then in May they decided that this wasn't enough and bumped it up to 2 seconds…

Comments are closed.