When can GetSecurityInfo API return ERROR_INSUFFICIENT_BUFFER?

A customer reported that under stress testing, they found that when they call the Get­Security­Info function to get the DACL of a job object, the call randomly returns ERROR_INSUFFICIENT_BUFFER. They can't find a pattern to the failure, and since the caller doesn't pass a buffer, it's not clear what buffer was insufficient. This happens even when the system is not under memory pressure, so it's not that the program itself was running out of memory.

What's going on is a race condition called out in the documentation:

This function does not handle race conditions. If your thread calls this function at the approximate time that another thread changes the object's security descriptor, then this function could fail.

Basically, what happens is that internally, the function asks for the size of the security info, allocates the memory, and then asks for the buffer to be filled in. If the security information increases in size between the first and third steps, then the buffer passed in the third step is insufficient, and that's the error that is returned.

It's apparent from the fact that this race condition is called out that the function doesn't handle the TOCTTOU error and merely propagates the error back to you, making it up to you to retry (if that's what you want).

Personally, I think this is a flaw in the implementation.

Comments (18)
  1. Kevin Fee says:

    It’s good that it’s called out in the documentation, and I agree with you that this is a flaw in the implementation, but if the implementation isn’t going to be fixed, the documentation should be more explicit as to what that means and how to handle the error. It’s a bit opaque as it is. Maybe they should link to this blog post.

  2. in whose implementation? Win API or customer app?

    1. From context, I would assume the Win32 API (which makes sense; if your underlying implementation has a race condition, you should try to handle it internally instead of forcing consumers to handle it unless there’s a strong reason to do so).

      1. Kevin says:

        Of course, Unix does this kind of thing too. See EINTR: for just about every syscall other than sleep(2), select(2), pause(2), and their relatives, the Right Thing is almost always a silent retry. In a mad anomaly, you can tell the implementation to retry some of these syscalls using SA_RESTART, but this must be done on every signal handler, and not at the call site. So it’s broken in much the same way as the O_CLOEXEC flag (i.e. libraries may fail to set it and shoot your program in the foot).

        1. Joshua says:

          Turns out the right thing isn’t a silent retry. You need to check your sigXXX handler global variables before retry.

          1. Kevin says:

            Eww, you use global variables? Just open a pipe(2) and stick it in your select(2) call in the main event loop. Then you write to the pipe when you handle a signal and process it like any other file descriptor event. Look Ma, no reentrancy issues!

            (Or you use sigprocmask(2)/pselect(2) to implement substantially the same pattern, but then the problem with EINTR does not arise.)

          2. Joshua says:

            @Kevin: I would prefer to use sigselect() myself but library authors often have to deal with code that does it the old way. :(

  3. Ken Hagan says:

    There’s an inherent race condition in any API that returns a value. If you’ve got another thread modifying the thing behind your back then the results might be invalid by the time the API actually returns to user-space. You’d need some kind of transaction mechanism to get around that, which is a serious step up in complexity.

    If you cannot protect yourself against the inherent race (because you don’t actually control the other threads) then being able to protect yourself from the more obvious ERROR_INSUFFICIENT_BUFFER is just a false sense of security. Conversely, if you can protect yourself against the inherent race (by coming to some synchronisation agreement with the other threads) then you won’t get the buffer error either.

    On balance, then, I think I’m with whoever wrote the existing implementation.

    1. DWalker07 says:

      A race condition is one thing; getting an Error_Insufficient_Buffer is a totally different thing.

      Yes, when you check (for example) if a file exists, and then you do something like read that file, it certainly could have been deleted after you checked but before you tried to read it. Also, the computer could crash after you check one thing and before you do something else.

      This does not mean that nothing is worth doing. But a better implementation might avoid the buffer error being returned to the caller, who can’t do anything about it (except retry).

    2. morlamweb says:

      Getting stale information from an API is not what I consider to be the issue here. The issue is that the API returns an “insufficient buffer” unpredictably, and expects the calling code to handle it (as though userland code can do anything to fix an API’s internal buffers!). I’d rather see the API handle this error internally; that is, if the buffer is insufficient, then increase the buffer size and re-query the DACL. Repeat until a result can be sent back to the calling code.

      1. Alex Cohn says:

        This won’t be a good decision. The (implicit) contract is that the API returns immediately. Well, immediate does not mean 3 CPU cycles, but still, I would have a problem with an API that randomly takes x10 longer, without “a pattern to the failure”.

        The documentation is “almost there”. At least, it should have explicitly mentioned ERROR_INSUFFICIENT_BUFFER as what could happen under the race condition. In a better world, the API could return a more relevant error code.

        1. laonianren says:

          This was my first reaction. You don’t want your low-level APIs randomly taking a much longer time. But I changed my mind for two reasons.

          First, this isn’t the low-level API. It’s a wrapper around GetKernelObjectSecurity, GetUserObjectSecurity, GetFileSecurity, etc.

          Second, calling GetSecurityInfo in a loop when it returns ERROR_INSUFFICIENT_BUFFER may never terminate. If the security descriptor is rapidly changing then the call to GetSecurityInfo may always coincide with an increase in size in the security descriptor. Sure, this is unlikely, but security holes are made of unlikelihoods. In other words, retrying inside the API is the only way to make it reliable.

          1. Alex Cohn says:

            But the system API will by unlucky chance get locked forever, unless it makes something sophisticated as locking the security descriptor. Which may have its nasty side-effects, too. If your program cannot control a race condition between two threads, where one perpetually changes security info, and the other perpetually tries to read it – maybe you should return to the drawing board and redesign your app. But in conventional situations I would prefer to deal once in a while with the need to retry GetSecurityInfo() rather than take care of never calling this function on UI thread.

          2. laonianren says:

            “But the system API will by unlucky chance get locked forever”

            The problem isn’t locking, it is the security descriptor increasing in size. This wouldn’t be a problem for the proposed design because the size of the buffer it allocates would get larger after each failure and the size of a security descriptor is bounded.

            The current design has no memory of its previous failures so it can fail in the same way endlessly.

          3. Alex Cohn says:

            Here is an easy scenario where the API will never return: imagine thread A that switches security object between 128 bytes and 256 bytes. In thread B, GetSecurityInfo function reads the size, allocates 128 bytes, tries to read 256 bytes, fails, checks the size again, and finds 128 bytes again.

            If the API could lock the object, the vicious loop would be stopped, but we don’t want GetSecurityInfo to be intrusive.

          4. laonianren says:

            Yes, this is exactly the point. GetSecurityInfo has this problem. It is a fundamental flaw in its design.

            But it wouldn’t have this problem if it looped internally. It would call GetKernelObjectSecurity as it does now. GetKernelObjectSecurity would fail because the security descriptor had increased in size from 128 to 256 but, crucially, GetKernelObjectSecurity would also return the new required buffer size of 256. So the next call would succeed. (Or the security descriptor size might have increased again but, as I said before, this must terminate eventually because is it bounded.)

          5. Alex Cohn says:

            Thanks, now I got your point.

Comments are closed.

Skip to main content