Why is fclose()'s contract so confusing?

Because it’s a long established pattern and contract, let’s explore fclose() today.

Here’s my ideal fclose() implementation:

int fclose(FILE *fp) {
close(fp->fd);
free(fp);
return 0;
}

But of course close() can fail.  Yuck.  I found this man page on close(2) on the web; maybe it doesn’t represent the modern/common close definition but even in my idealized world, the implementation of fclose() has to be something more like:

int fclose(FILE *fp) {
while (close(fp->fd) == -1) {
if (errno != EINTR) {
return EOF;
}
}
free(fp);
return 0;
}

I picked on EINTR to loop on here because the man page for close() calls it out as a condition (without mentioning whether there are or are not un-rolled-back side effects).  By my reading I assume that it means that the close() was not able to complete.  If the I/O wasn’t able to complete and I don’t get a chance to retry it, this would seem to be a very fundamental break in the I/O API design so I’ll assume that it has the contract I believe (a case of Raymond’s “What would happen if the opposite were true” analysis technique).

Notice also that something very important happened here.  There’s an implicit contract in close()-type functions which is that regardless of “success” or “failure” the resource will nonetheless be freed.

But we didn’t deallocate the FILE!  Maybe the right implementation was:

int fclose(FILE *fp) {
int fd = fp->fd;
free(fp);
while (close(fd) == -1) {
if (errno != EINTR) {
return EOF;
}
}
return 0;
}

Is this right?  Probably not; if we’re going to return an error like ENOSPC, callers need to be able to retry the operation.

The simple answer is that the close() pattern is a very special contract.  Nobody’s actually going to sit in a loop calling it.  It must deallocate/free the resources.  If the state of the address space/process is trashed, you might as well kill the process (ideally tying in with a just-in-time debugging or post-mortem-dump debugging mechanism).

But even my totally idealized fclose() implementation is unrealistically simple.  These are the buffered file I/O functions in the C run time library.  Since people are too lazy to call fflush(), the real implementation has to look something like this:

int fclose(FILE *fp) {
if (fp->bufpos != 0) {
const void *buf = fp->buf;
size_t bytestowrite = fp->bufpos;
while (bytestowrite != 0) {
ssize_t BytesWritten = write(fp->fd, buf, bytestowrite);
// -1 is error; otherwise it’s number of bytes written??
if (BytesWritten == -1) {
// what the heck am I going to do now? Holy crud look
// at all the error codes in the man page! Am I really
// going to try to figure out which of those are
// retry-able?
// I guess I’ll just cut and run
return EOF;
}
// docs for write(2) don’t say that successful writes will
// have written all the bytes. Let’s hope some did!
// but wait, what if, say -2 is returned? What if I asked for
// more bytes to be written than ssize_t can represent as
// a positive number? The docs don’t say; chances are the
// implementation is naïve but it seems wrong to second guess
// the apparently intentional use of ssize_t instead of
// size_t for the return type.
assert(BytesWritten >= 0); // do you sleep better with this?
bytestowrite -= BytesWritten;
buf = (const void *) (((size_t) buf) + BytesWritten);
}
}
while (close(fp->fd) == -1) {
if (errno != EINTR) {
return EOF;
}
}
free(fp); // at least free is void!! Phew!
return EOF;
}

My point here is not to lambaste the U*ix syscall APIs in particular; they’re just one in a long line of badly designed contracts.  Deep down, CloseHandle() has a similar loop inside it because just one last IRP has to be issued to the file object and if the IRP can’t be allocated the code loops.  Arguably the IRP should have been allocated up front but people have a hard time stomaching such costs.

What I’m really trying to show is how you can design yourself into a corner.  The C buffered stream functions did this on two axis simultaneously.  The fclose() function has to try to issue writes to flush the buffer.  (Yes, it probably would have called fflush() but since fflush() doesn’t guarantee forward progress, presumably the calls to fflush() would also have to be in a loop.)  The underlying I/O implementation is fallaciously documented – that is it tried to look like it did due diligence in documenting all the things that can go wrong but even with a mature man page on a mature API, there’s no real clue left behind about what to do about these conditions.

The real point here is that rundown protocols don’t get enough attention.  I’m sure that someone thought that the design for close(2) was great.  There’s tons and tons of information about what errors you can get back!  But, in essence its contract isn’t very usable.  The C run time function fclose() has it even worse since nobody trained people to call fflush() if they really want their data written to disk.

The key that should be showing up is this:

Code that is executed in resource deallocation or invariant restoration paths is very special.  How special?  I’m still not sure; we’re taking this journey of discovery together, but you can be sure that nobody’s going to write calls to close functions in a loop hoping that it’ll someday succeed, so I’m sure that returning errors is useless.  Performing operations which may fail due to transient lack of resources is very very problematic.

Tomorrow we’ll hop back over to invariant restoration and see a bunch of the same problems.   Later we’ll journey over to the Promised Land, where all side effects are reversible… until committed.