What's wrong with this code, part 6 - the answers

In yesterdays post I presented a trace log writer that had a subtle bug.

As I mentioned, the problem had nothing to do with the code, but instead, the problem had to do with the directory in which the trace log file was written.

My second hint to the problem was that only symptom that the person who wrote the code saw was that the trace logger would only ever write one line to the file, and then only after the file was deleted.

And it turns out that that’s key to the problem.  Consider a folder with an ACL that allows the user to create files, but not write to them.  This isn’t as strange as you might imagine, it’s a common pattern to grant users the ability to create files in a directory but not to modify them once they’ve been created.  As a simple example, consider your review – you should be allowed to post your review to a shared folder, but once the review’s been posted, it’s a bad idea for you to be able to modify the document.  On the other hand, you do want to be able to read that file after it’s been posted.

When you have a directory that is set up in this fashion, the use can create files (they have the FILE_CREATE_FILE access right to the directory), but they can’t modify the files once they’re created (the OBJECT_INHERIT_ACE | INHERIT_ONLY ACE on the directory only grants GENERIC_READ access to the file).

The thing that’s confusing about this is that the user was allowed to write to the file when it was created, but not when they opened the file subsequently. 

The reason for this is that in NT, a user is granted full control over an object that they create, but once the object has been created, the ACLs on that object apply.  So when the LogMessage API first created the file, it was granted full access to the file.  And the user was able to write their data to the file.  Once they closed the file, however, their full access to the file reverted back to the ACL on the file, so for the second write, the call to CreateFile() failed.

It turns out that this behavior has the potential to hit a large set of applications.  There is a very common design pattern intended to protect an application’s data files from data corruption when saving a file.  This design pattern works by first creating a temporary file, writing the contents of the document to the temporary file, and if the temporary file was written successfully, it deletes the old file and renames the new file over the old file.  It then sets the security descriptor on the renamed file to match the original security descriptor on the file.  The problem is that before the rename happened, the application closed the temporary file.   As a result, the ACL on the newly renamed file is the default ACL for the directory, which may not allow the user to modify the security descriptor on the file.

So what WAS wrong with the code?  Well, the problem is in the call to CreateFile:

    traceLogHandle = CreateFile(TRACELOG_FILE_NAME, FILE_APPEND_DATA, FILE_SHARE_READ, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

The 4th parameter to the CreateFile API is the SECURITY_ATTRIBUTES structure that is to be applied to the new file.  That structure contains inheritance information about the file, but it also the security descriptor to apply to the new file.  In this case, either the TRACELOG_FILE_NAME should be set to a directory that the developer knows grants the user access to the file, or the developer should specify an ACL for the newly created file that grants access to the file.

Oh, and before anyone asks why NT did such a silly thing as granting the user full control over a newly created resource, consider the save your review to the shared folder example above.  If the ACL on the folder applied, then you’d never be able to write the contents of the file in the first place.

This is a general principal of all objects in NT, in case it wasn’t obvious.  It applies to all kinds of handles, from file handles to event handles, to registry handles. I’ve debugged dozens of problems in the past related to this problem – someone creates an object, sets some things on that object, closes the object and later attempts to modify the object and the modification fails.  Just about every time I’ve ever seen it, the problem has been related to the ACL on the object.

 So Kudo’s and Mea Culpa’s:  Once again, Mike Dimmick was the first person to catch on to what I was trying to demonstrate.  He framed it in terms of a limited access service account, but the principal is the same – you can create an object but can’t re-open it for the same access after it’s inherited its ACL from the parent.

And as usual, there were unintentional bugs in the code.  The biggest one was caught by Skywing; he correctly caught the fact that the EnterCriticalSection should be protected with a try/except since it can throw (EDIT: THIS IS NOT TRUE ON WINDOWS XP AND BEYOND, SEE BELOW).  Several people (Keith Moore, Sayavanam, Dave M, caught issues with my use of StringCbVPrintfEx (not checking return codes, not explicitly using the A version of the API.  And Niclas Lindgren had a fascinating series of posts about exception safety in applications (which I’ll address in the near future in a separate post).

Normon Diamond also caught the fact that the cleanup code checked for the handle not being null, but INVALID_HANDLE_VALUE is -1. 

Edit: Pavel Lebedinsky points out in the comments below that on XP and beyond (W2K3, Longhorn) EnterCriticalSection will NOT raise exceptions.  So Skywing's point is actually incorrect.