WHat's wrong with this code, part 9 (File Copy) - the answers

Wow, where to begin on this one. Lets start with the easy one, the simple coding error. Here it is:     //    // Account for data read.    //    dataBlock->dataLengthWritten->QuadPart += dataBlock->readLength; 

The problem with this is that 64bit math isn't atomic on 32bit processors.  Which means that if you tried to copy files over 4G in size, then the addition would potentially mis-order the writes to the two halves of the dataLengthWritten field, causing measurement issues.

To fix that, you need to use the InterlockedExchangeAdd64() function which will perform the operation atomically.

Now for the complicated coding error.  That one shows up here:    if (!SetFilePointerEx(dataBlock->sourceHandle, dataBlock->fileDataOffset, NULL, FILE_BEGIN))    {        printf("Unable to set current file position in source to %i64d: %d", dataBlock->fileDataOffset, GetLastError());        exit(1);    }    DWORD readLength;    if (!ReadFile(dataBlock->sourceHandle, dataBuffer, dataBlock->readLength, &readLength, NULL))    {        printf("Unable to read data from file: %d\n", GetLastError());        exit(1);    } 

The problem here is that the file wasn't opened for overlapped I/O.  If you don't specify FILE_FLAG_OVERLAPPED on the call to CreateFile, then the OS will synchronize access to the file.  It does that by acquiring a lock across each file operation.  But the key thing is that this lock is only held across individual operations.  So if there were two worker threads running, their calls to SetFilePointerEx call would reset each other.  In other words, you'd have:

Thread 1 Thread 2
Set file pointer to 0  
  Set file pointer to 20
Read from file  
  Read from file

In general, if you don't specify FILE_FLAG_OVERLAPPED, then the OS makes no guarantees about what happens with file I/O if you attempt to do the I/O on more than one thread. 

So the first thread would think its reading at 0, but would in fact be reading at 20.  To fix this, the file would have to be opened with the FILE_FLAG_OVERLAPPED flag.  The copy routine would have to remove the calls to SetFilePointerEx and instead replace the ReadFIle with:     DWORD readLength;    OVERLAPPED readOverlapped;    readOverlapped.hEvent = CreateEvent(NULL, FALSE, TRUE, NULL);    readOverlapped.Offset = dataBlock->fileDataOffset.LowPart;    readOverlapped.OffsetHigh = dataBlock->fileDataOffset.HighPart;    if (!ReadFile(dataBlock->sourceHandle, dataBuffer, dataBlock->readLength, NULL, &readOverlapped))    {        DWORD error = GetLastError();        if (error == ERROR_IO_PENDING)        {            if (!GetOverlappedResult(dataBlock->sourceHandle, &readOverlapped, &readLength, TRUE))            {                printf("Unable to read data from file: %d\n", GetLastError());                exit(1);            }        }        else        {            printf("Unable to read data from file: %d\n", GetLastError());            exit(1);        }    }    CloseHandle(readOverlapped.hEvent); 

And this leads neatly to the design errors.  As I mentioned, there are at least three of them.

The first is fairly large - the reality is that this code doesn't actually overlap reads and writes - instead it replaces them with synchronous operations.  Which doesn't really generate a lot of parallelism - the code is still spending most of its time blocked waiting on I/O to complete.  Instead, the code should be reworked based on a finite state machine (or recoded to use ReadFIleEx with a callback function) to force the code to proceed in a more orderly fashion.

The second design error is that the code will effectively read (and writes) randomly through the file, while QueueUserWorkItem queues requests in order, if there are more than one worker thread running, the order in which operations run is effectively random.  This means that the cache manager will thrash mightily trying to pre-read data.  This can be fixed by specifying FILE_FLAG_RANDOM_ACCESS on the CreateFile.  But the code STILL has problems - it would be nice if the OS could read ahead of the writes and doing random I/O doesn't help.  In reality, using unbuffered I/O (and limiting the I/O to the native cluser size on disk) is probably a better choice - in other words, accepting to cache the file data internally.  But the I/O ordering issue that leads to the third design failure.

The third error is that by extending the file by doing asynchronous writes if you ever write beyond the current "valid data length" of the file, the filesystem starts a lazywriter thread backfilling the file data with zeros.  If you later come in and write that data (which will happen with this design) you end up writing twice as much data. And that leads to a fourth design error.

The fourth design failure is that the code isn't recognizing the physical limitations of hard disks - hard disks typically have only one spindle and only one read arm (with multiple heads).  They can only do one thing at a time, which means that if your source and destination are on the same drive, overlapping the reads and writes in this manner is going to thrash the hard drives heads - which will hurt your performance.  You'd be better off reading as much of the data from the source at once and then writing it out to the destination.  

Oh, and as to why my testing didn't show any problems?  That's because QueueUserWorkItem is quite conservative about spinning up new threads - in all my testing, I never saw more than one thread running - so my code didn't actually speed up anything because it was never run in parallel.  And that's a fifth design error (although if you'd made the entire process fully asynchronous, it wouldn't need ANY worker threads).

Kudos:

B.Y. nailed almost all the issues in the first comment.

The first issue (synchronization of the byte count was also caught by Skywing).

Stuart Ballard caught the fact that the random writes had the potential of causing fragmentation issues.

Harald totally nailed the perf issue covered by random I/O to the disks (the 4th issue above).

Later on, B.Y. nailed the concept of using a read queue and a write queue for the I/O.

Not bugs:

Harald pointed out that alternate data streams aren't copied, that's by design (see the original article).

Capt. Jean-Luc Pikachu asked about the operator new not throwing.  By default (unless you #include <new>) the C runtime operator new doesn't throw, instead it returns null.

Mea Culpas:

B.Y. also caught:

  • A missed error check on the QUeueUserWorkItem() call
  • A potential exception if an allocation error occurs inside the loop queuing the items

Mike Dimmick also pointed out issues with the process heap critical section (used under the covers by operator new).  This is a very real scalability issue, and since this particular problem is all about scalability it is relevant.  He also picked up on some of the worker thread queue issues that made up the fifth error above.  

The reason I started writing this "what's wrong with this code" post was to come up with an example that shows the issue of doing I/O to a synchronous file handle on more than one thread, it was only after writing the code that I started realizing what a truly extraordinary bug farm that could be found from just the simple task of copying a file as quickly as possible.  When I started writing the answer, I thought I'd identified only two design issues, by the time I'd finished I had five of them.

The simple answer is: Use CopyFile and let the OS do the heavy lifting for you :)