Does the CopyFile function verify that the data reached its final destination successfully?


A customer had a question about data integrity via file copying.

I am using the File.Copy to copy files from one server to another. If the call succeeds, am I guaranteed that the data was copied successfully? Does the File.Copy method internally perform a file checksum or something like that to ensure that the data was written correctly?

The File.Copy method uses the Win32 Copy­File function internally, so let’s look at Copy­File.

Copy­File just issues Read­File calls from the source file and Write­File calls to the destination file. (Note: Simplification for purposes of discussion.) It’s not clear what you are hoping to checksum. If you want Copy­File to checksum the bytes when the return from Read­File, and checksum the bytes as they are passed to Write­File, and then compare them at the end of the operation, then that tells you nothing, since they are the same bytes in the same memory.

while (...) {
 ReadFile(sourceFile, buffer, bufferSize);
 readChecksum.checksum(buffer, bufferSize);

 writeChecksum.checksum(buffer, bufferSize);
 WriteFile(destinationFile, buffer, buffer,Size);
}

The read­Checksum and write­Checksum are identical because they operate on the same bytes. (In fact, the compiler might even optimize the code by merging the calculations together.) The only way something could go awry is if you have flaky memory chips that change memory values spontaneously.

Maybe the question was whether Copy­File goes back and reads the file it just wrote out to calculate the checksum. But that’s not possible in general, because you might not have read access on the destination file. I guess you could have it do a checksum if the destination were readable, and skip it if not, but then that results in a bunch of weird behavior:

  • It generates spurious security audits when it tries to read from the destination and gets ERROR_ACCESS_DENIED.

  • It means that Copy­File sometimes does a checksum and sometimes doesn’t, which removes the value of any checksum work since you’re never sure if it actually happened.

  • It doubles the network traffic for a file copy operation, leading to weird workarounds from network administrators like “Deny read access on files in order to speed up file copies.”

Even if you get past those issues, you have an even bigger problem: How do you know that reading the file back will really tell you whether the file was physically copied successfully? If you just read the data back, it may end up being read out of the disk cache, in which case you’re not actually verifying physical media. You’re just comparing cached data to cached data.

But if you open the file with caching disabled, this has the side effect of purging the cache for that file, which means that the system has thrown away a bunch of data that could have been useful. (For example, if another process starts reading the file at the same time.) And, of course, you’re forcing access to the physical media, which is slowing down I/O for everybody else.

But wait, there’s also the problem of caching controllers. Even when you tell the hard drive, “Now read this data from the physical media,” it may decide to return the data from an onboard cache instead. You would have to issue a “No really, flush the data and read it back” command to the controller to ensure that it’s really reading from physical media.

And even if you verify that, there’s no guarantee that the moment you declare “The file was copied successfully!” the drive platter won’t spontaneously develop a bad sector and corrupt the data you just declared victory over.

This is one of those “How far do you really want to go?” type of questions. You can re-read and re-validate as much as you want at copy time, and you still won’t know that the file data is valid when you finally get around to using it.

Sometimes, you’re better off just trusting the system to have done what it says it did.

If you really want to do some sort of copy verification, you’d be better off saving the checksum somewhere and having the ultimate consumer of the data validate the checksum and raise an integrity error if it discovers corruption.

Comments (30)
  1. Damien says:

    And, of course, the file system is an external resource for your application. There's no guarantee that something else won't go in and re-write the contents of that file as soon as it can.

    Similar to the issue with people wanting to check for a files existence before accessing it, or any other resource with some level of autonomy (e.g. Internet access). Yes, you can go and check whether the file exists before you attempt to open it, but that doesn't mean you don't still have to write the code to cope with the file not being present when you attempt to open it.

  2. acq says:

    You're overthinking it Raymond. The customer just wanted to know if he can trust the result code of the File.Copy (oh wait, it's void! :) ) Well he wants to know if he can assume that if there's no exception the file was actually copied to another server. And the answer to that is yes.

    And you can tell him that the network transfer protocols typically don't corrupt the files. However, bad hardware can corrupt the files anywhere, not only over the network. And yes, for such cases, he needs his own checksums.

  3. Adam Rosenfield says:

    If you're copying the file to a remote file system, the network could drop out immediately after you finished copying it, so even if the copy completed successfully, you couldn't verify it anyhow (I guess that's similar to the write-only case).

  4. Matt says:

    I think perhaps part of the question was "how does CopyFile verify that the data got to the destination NETWORK drive successfully".

    CopyFile over the network will use either SMB or WEBDAV to write the file – both of these will internally use TCP, which has not one – but two checksums (the TCP checksum and the IP checksum).

    So the answer is that yes, CopyFile uses an internal checksum when copying to a network drive. This internal checksum lives in the TCP/IP stack in Windows.

  5. Brian K says:

    What does the command prompt Copy command do when "/v" is specified?

  6. Roger says:

    My thought on reading the question is if FlushFileBuffers is called at the end.  After that returns you can reasonably believe that the file will be there even if the power fails etc.  On Unix systems you often want to call fsync on the parent directory too since it is possible for the directory to be delayed write.

  7. Dominik says:

    So now the question is, how does WriteFile work?

  8. Adrian says:

    Raymond's points are all true, but that doesn't mean you don't ever want to try to verify a copy.

    After all, in a command prompt, the copy command has a /V (verify) option (which presumably reads back the copied file and compares it to the original).

    And a common thing to do in my VMS days was to back up files to tape and then delete them from the disk.  Obviously you didn't want to delete the files until you were sure the files made it to tape.  Thus the common command: BACKUP/VERIFY/DELETE <source_files> <target>.  Yeah, the tape might become unreadable later, but that's a separate issue (with separate mitigations) from making sure that the files actually got onto the tape.

  9. AsmGuru62 says:

    I think /V is a remnant of DOS days (1986-ish) where 5" floppies were prone to corruption from 'stray' magnetism, etc.

  10. Bill S says:

    @Brian K: support.microsoft.com/…/126457

    I suspect the /v option exists to maintain backward compatibility but that it doesn't do anything (useful) any more.

  11. Random832 says:

    To pre-empt @Mason Wheeler's next question, the reason you would deny read access to an uploads directory is to prevent people from using it as a hosting service for files not approved by the server's administrator.

  12. Maurits says:

    I assume that if one of the ReadFile or WriteFile calls fails, then CopyFile will report that failure uplevel (or perhaps retry)

  13. Scott says:

    Last time I watched procmon's disk activity when running copy with and without the /v option there was no difference.  This was on XP, YMMV.

  14. 640k says:

    You have to set "VERIFY ON" before copying.

  15. Mark says:

    This would also be a case of optimising incorrect behaviour. The OS shouldn't need to second guess the network filesystem's ability to detect failures.

  16. Mason Wheeler says:

    This really sounds like you're overthinking it.

    First, what possible reason would there be for giving someone write access but not read access to a certain location?  That's screwed up on so many different levels…

    Second, you're right that having the sender compute a checksum of the destination file is a bad idea for all the reasons mentioned.  But why did you even think of doing that in the first place?!?  If I was implementing a system like that, I'd have the *destination* system compute the checksum on the file it received and send it back to the sender for verification.

    [An "uploads" directory is a common scenario where people have write but not read access. -Raymond]
  17. Cheong says:

    @Matt: I second your answer. TCP/IP guarantee delivery of packet. If the payload is lost on the street, it'll tell you that. When the payload reaches the destination, trust the file server to handle it properly. If you can't trust the file server (either hardware ot software) to do it's job, perheps it's better to change another one because it's one of file servers' primary tasks.

  18. Drak says:

    As an example of Raymond's last statement, I believe ZIP and other archive types include a checksum specifically so that the end consumer can check whether the ZIP is correct.

  19. Gabe says:

    Matt: The IP checksum is just for the header, so it is useless in detecting a corrupt payload. The TCP checksum is just 16 bits, so it's quite possible for one bit error to cancel out another one without being detected.

    Fortunately Ethernet has its own overlapping 32-bit CRC, which should make copying over a LAN pretty safe (WANs are another story all together). However, back around 2000, Stone and Partridge ("When The CRC and TCP Checksum Disagree") looked at actual packet errors and estimated that the actual odds of undetected errors in Ethernet TCP packets are somewhere between 2^-24 and 2^-33.

    While it seems unlikely, imagine you're simply copying the contents of your 2TB disk across the network — that's 2^44 bits, or about 2^30 packets!

    In other words, even under the least likely circumstances, your network backup of an office full of 1TB hard drives is going to have several corrupted files if your backup software doesn't have its own integrity checks in place (like SMB message signing, which is off by default).

  20. AC says:

    I once tried copying several large files over a WLAN network (desktop and laptop in the same room!) simply by using explorer and shared folders. My reasoning was that as the wireless reception was near perfect and TCP/IP has the mentioned checksums, this should work without problems.

    But after several tries, the files always came out corrupted. I'm still not sure why the TCP/IP checksumming didn't prevent this to happen almost predictably.

  21. Matt says:

    @Gabe:

    The checksums in Network protocols are designed to detect common network bitpattern errors, they are not designed to detect all accidental damage caused on the network, and don't even try to prevent malicious damage caused on the network.

    If you don't like TCP's checksum, you can always upgrade it using the TCP-Alternative-Checksum-Request TCP Option. Alternatively, you should probably be using SMB or IPSec for your traffic, both of which upgrade their checksums to cryptographic hashes and make your file unreadable and tamper-proofed from evil people sitting on your network.

    But CopyFile won't do that for you. CopyFile just hands the bytes off to the filesystem driver, who will hand them off to the network driver. What the network driver chooses to do with those bytes is up to it.

  22. Gabe says:

    Ian: They say: "After eliminating those errors that the checksum always catches, the data suggests that, on average, between one packet in 10 billion and one packet in a few millions will have an error that goes undetected."

    They do NOT talk about there having to be an error in the first place. Even so, there are over 15 hops between me and this blog site (just to take an arbitrary example), at least one of which is a wireless link over unregulated frequencies (WiFi).

    If errors were uniformly distributed, maybe you can assume they will be rare. However, that's not the case: some links are just more error-prone than others, whether it's a bad hardware, buggy software, or EMI. The more hops there are between two hosts, the more likely that one of them will be over one of those bad links, which drastically increases the odds of undetected errors.

  23. @Gabe

    Of course, if you look at the frequency of corrupt data you actually get then it makes you wonder if how much of an impact those numbers have in real life.

    I can't remember getting a corrupt file that could be blamed on the network from the internet in the last year (the corrupt files that I have received could be accounted to something else, like a congested server truncating files or the owner the broadband hardware doing maintenance and that caused all of the users to have problems for a few minutes).

    My LAN has had an even better track record, I have never had a corrupt file transfered over this network at all, and it has been active for years. So if this "one packet in 10 billion" ends up with you never getting an error on a LAN or only around one filea year from the internet being corrupt, then I don't see why you are making such a fuss.

  24. Ian says:

    @Gabe It looks like the 2^-24 to 2^-33 probability is of an error being undetected *given* that there was an error in the first place. Since the probability of an error itself is very small I don't think you'll end up with nearly as many corrupted files as you think.

  25. I once had a network card whose default network drivers bundled on the Windows 95 CD would randomly fail to verify checksums when the card was under load, and pass along corrupt data as if it was correct.  That was fun to figure out.  Fix was to install newer drivers from the vendor since the ones on the Win95 CD turned out to be junk.  I even think there was an MSKB article about this but I can't remember.

  26. satan says:

    SATA error are written to event viewer instead of propagated back in function calls.

  27. At a previous company, I was responsible for putting the demo version of our product on our web site.

    One day, we started getting complaints that the demo would not install.

    My copy of the installer worked, but the copy I downloaded from our site was corrupted, so I re-uploaded.

    After that, I downloaded again – the file was still corrupted, but differently.

    I tried downloading the file several times in succession, and each copy was slightly different.

    I called into hosting provider for support, and the kid who answered said he'd rebuilt the kernel on the server to make downloads more efficient…

  28. Danny says:

    "If the call succeeds, am I guaranteed that the data was copied successfully?"</quote>

    Answer : NO!! No one will ever guarantee that..ever. Why? Liability. Q.E.D.

  29. Pentium100 says:

    There was a bug in Server 2003 that corrupted files written to shared folders (corrupts on receive). After that, I started using a program that verifies copied files.

  30. Daniel says:

    TeraCopy does create some checksum for each file it copies/moves.

Comments are closed.