How can I append to a file and know where it got written, even if the file is being updated by multiple processes?


A customer had a collection of processes, all of which are writing to a single file. Each process wants to append some data to a file and also know where the appended data was written, because the location of the appended record needed to be saved somewhere else. "We are currently using a named mutex derived from the path to the file. To add a new record, we take the mutex, set the file pointer to the end of the file, record the current position, write the data, then release the mutex. This works, but it feels clunky, and it is vulnerable to multiple names for the same file, or multiple computers trying to append to the same file. Is there a better way?"

Now, if the program needed to append data but didn't care where it got appended, then it could make the file system do the work: Open the file for FILE_APPEND_DATA | SYNCHRONIZE and nothing else. (In particular, do not open for FILE_WRITE_DATA.) This is documented as meaning that the caller can write only to the end of the file, and any offset information provided in the write operation is ignored. Unfortunately, the technique doesn't tell you where the data got written,¹ so it doesn't help in this case.

This is a job for Lock­File. In fact, this is not only a job for Lock­File, this is precisely the job that Lock­File was created to solve. The Lock­File is so proud of its job that there's even a sample program right there in MSDN showing how to use file locking to append data. But that sample isn't quite the scenario we have here, because that sample assumes that only one process is writing (because it opens the file in deny-write mode), and it merely needs to lock out reads. In our case, we also want to permit others to write to the file, except when we are extending.

I sketched out a few different algorithms for the customer. First, you could agree that byte zero is the "I am appending" signal. This is merely using the file as its own synchronization object.

// Requires that everybody agree that byte 0 is the lock
AppendData()
{
 LockFile(from 0 to 0);
 size = GetFileSize;
 WriteAt(size, data);
 UnlockFile(from 0 to 0);
}

But choosing byte zero makes that byte of the file inaccessible while the lock is held, even though it is unrelated to the append operation. Therefore, you are probably better locking a nonexistent byte well beyond the anticipated maximum file size. Byte 0xFFFFFFFF`FFFFFFFF will probably do nicely.

Better would be to use file locking in the way it was intended: To assert access to a range of bytes in the file because you actually want to access them. (File locking comes from the database world, where you would lock a record, perform an update, then unlock the record.)

AppendData()
{
 originalSize = GetFileSize;
 LockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
 actualSize = GetFileSize;
 WriteAt(actualSize, data);
 UnlockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
}

The idea here is that you lock the entire remainder of the file, from its current size out to infinity. If the file size changes before the lock, that's okay, because the file only grows in size, so we locked more than necessary.

If it's possible for the file to shrink in size, then you need to detect that case and expand the lock so that it covers the region you intend to write to.

AppendData()
{
 originalSize = GetFileSize;
 LockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
 actualSize = GetFileSize;
 if (actualSize < originalSize) {
  UnlockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
  originalSize = actualSize;
  LockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
 }
 WriteAt(actualSize, data);
 UnlockFile(from originalSize to 0xFFFFFFFF`FFFFFFFF);
}

Or you can be sloppy and just lock the entire file. It's more expansive than you need, but it'll get the job done.

AppendData()
{
 LockFile(from 0 to 0xFFFFFFFF`FFFFFFFF);
 size = GetFileSize;
 WriteAt(size, data);
 UnlockFile(from 0 to 0xFFFFFFFF`FFFFFFFF);
}

The customer was okay with the sloppy version, and noted that using file locks also solves the problem of files with multiple names (due to hard links or network aliasing), as well as permitting multiple computers to operate on the file simultaneously.

¹ You might hope that the OVERLAPPED.Offset member would be updated with the actual file offset used, but sadly it isn't.

Comments (7)
  1. Damien says:

    In your "shrinkable" sample, shouldn't you loop? If the file can both grow and be shrunk by external actors, isn't it possible that it shrank again between your UnlockFile and LockFile inside the if?

    [You're right. Better would be to require that the file be locked when shrunk. Then you wouldn't need to re-lock. (Basically, fall back to "Use byte 0xFFFFFFFF`FFFFFFFF has the "I'm changing the file size" signal.) -Raymond]
  2. Joshua says:

    FILE_APPEND_DATA is tricky unless you like occasional split writes (therefore I only use it for log files). Good call on LockFile. LockFile doesn't work reliably across a network due to unavoidable physics. There ought to be a solution using transactional filesystem (handle the rollbacks by starting again).

  3. waleri says:

    What about using

    FILE_APPEND + GetFileSize + WriteFile.

    Eventualy WriteFile would fail if you're NOT at the end of file and in this case retry the operation.

    [If you open in FILE_APPEND_DATA mode, then the WriteFile ignores your current position and always appends, as noted in the linked article. (I.e., WriteFile succeeds even if you are not at the end of the file.) -Raymond]
  4. Michael says:

    Apart from "being sloppy" (and preventing writing to earlier parts of the file, if that's something you need) are there any other downsides to locking the entire file?

    Ie. is it more expensive in addition to being more expansive?

  5. Douglas says:

    @Michael

    Disclaimer: I'm just guessing here.

    On its own, probably not at all. With no readers, again, probably not.

    Where it would matter is if somebody wanted to read the first however many bytes/chunks/logs of the file. If you only lock the 'end' of the file and the reader only locks the 'beginning', both can work concurrently, objectives permitting.

    Going off on a tangent:

    You could probably make the data consumer process the file a chunk at a time whenever the filesize increases. At that point, you should probably just use a pipe (or named pipe, or FIFO, etc.). For saving the data as it goes through (if you want that), you use a "tee" program.

  6. Juan Garcia says:

    I am afraid to ask. Yet the desire to know the answer beats me. Why don't you use a critical section when writing the file? (Just asking)

  7. Sean says:

    @Juan Garcia

    Because a critical section only works within a single process. The customer was asking for a solution which would work not only cross-process, but also cross-computer, as well as handling the possibility of the same file being accessed through different paths.

    The existing solution the customer had used mutexes, which are named and cross-process (a process can access existing mutexes created by other processes using their names), but those do not work with processes on different computers. The mutex solution also required each process writing the file to use the same mutex name for that file, and their approach (to use the file's path) is not guaranteed to work (files can have multiple paths, e.g. due to a hard link to the file).

Comments are closed.

Skip to main content