SafeHandle: A Reliability Case Study [Brian Grunkemeyer]

SafeHandle is the best way to represent handles in managed code today.  For a high-level overview of what SafeHandle does, Ravi posted a writeup on the BCL blog titled SafeHandles: the best v2.0 feature of the .NET Framework.  However, if anyone wants to deeply understand the type, or how we were forced to providing such a useful construct, here’s a detailed writeup on SafeHandle.  Some portion of this text should appear in the .NET Framework Standard Library Annotated Reference, Volume 2 when it comes out – look at the IntPtr class.

 

IntPtr is of course the bare minimum type you need to represent handles in PInvoke calls because it is the correct size to represent a handle on all platforms, but it isn’t what you want for a number of subtle reasons. We came up with two hacky versions of handle wrappers in our first version (HandleRef and the not-publicly exposed HandleProtector class), but they were horribly incomplete and limited. I’ve long wanted a formal OS Handle type of some sort, and we finally designed one in our version 2 release called SafeHandle.  SafeHandle is an enormous win for library writers and end users, but understanding why requires understanding the flaws with simply using IntPtr.

 

Race condition with your own finalizer

Using an IntPtr to represent a handle opens the door to a subtle race condition that can occur when you have a type that uses a handle and provides a finalizer. If you have a method that uses the handle in a PInvoke call and never references this after the PInvoke call, then the this pointer may be considered dead by our GC. If a garbage collection occurs while you are blocked in that PInvoke call (such as a call to ReadFile on a socket or a file), the GC could detect the object was dead, then run the finalizer on the finalizer thread. You’ll get unexpected results if your handle is closed while you’re also trying to use it at the same time, and these races will only get worse if we add multiple finalizer threads.

 

This fact that the CLR may finalize an object while another thread is running code in an instance method is extremely surprising to most people, and understandably so.  The reason the CLR does this is to be very aggressive about object lifetimes.  Keeping objects around in the GC heap longer than strictly necessary uses more memory.  So the JIT is extremely aggressive about reporting the minimal lifetime necessary for an object reference to the CLR’s GC.  In the case of finalizable objects making P/Invoke calls, this is a little unfortunate, and prompted us to add in GC.KeepAlive.  From a platform perspective our performance should be better because of this change, even though it does lead to a rather obscure race condition.

 

Here’s the canonical example showing where someone needs a call to GC.KeepAlive:

 

    class Foo {

        Stream stream = ...;

        protected void Finalize() { stream.Close(); }

        void Problem() { stream.MethodThatSpansGCs(); }

        static void Main() { new Foo().Problem(); }

    }

 

Here, MethodThatSpansGCs is any managed method or P/Invoke call that can allow a GC to occur.  For all intents & purposes, consider this to be any managed method.  The method Main simply allocates a Foo() then calls a method on it, but does not use a reference to that Foo after the calling the method Problem().  So once we have made the method call (a call or callvirt IL instruction), Main no longer has a live reference to an instance of Foo.  In the method Problem, the this pointer is used to load the stream field, but then the this pointer is no longer used, so our JIT reports that it is no longer live after loading stream.  These were the only two references to that instance of Foo in the program, so Foo can be finalized after the next GC.

 

The remaining piece of the puzzle to cause a problem is that we need to do something that can trigger a GC, and we need an observable side effect that can prevent this operation from succeeding.  In theory, most managed code can be suspended for garbage collection between any machine instruction.  In practice our x86 JIT does not do this for all method bodies, but it can theoretically happen.  However if stream’s MethodThatSpansGCs() is a P/Invoke call or allocates memory, there’s a good chance we’ll allocate, which can cause the GC to perform a collection with some low probability.  

 

If the GC does run while in the MethodThatSpansGCs code, then Foo will be reported as not live, and will be finalized.  The finalizer thread (of which there is currently one, but we may use multiple in the future) will then run the finalizers for all finalizable, dead objects.  It could run Foo’s Finalize method, which closes the stream.  After this happens, MethodThatSpansGCs may not work right at all, since it has just been called on a closed stream.  Notice that to observe this problem, you must have the GC collect objects at a certain point, and you must have the finalizer thread run your finalizer within this window as well.  This is rare in practice, but will show up with sufficient stress testing.  (Internally we’ve added something called GCStress to the CLR to help find some of these types of problems.  While primarily aimed at finding problems in our “manually managed” C++ code within the CLR itself, it can be useful to help trigger GC’s in unexpected places.)

 

To work around this problem, you can add a call to GC.KeepAlive(this) in your code after your PInvoke call, or you could use HandleRef to wrap your handle and the this pointer.  In V1 and V1.1 of the .NET Framework, the most general solution was simply to use GC.KeepAlive in the Problem method, like this:

 

        void Problem() {

            stream.MethodThatSpansGCs();

            GC.KeepAlive(this);

        }

 

Here, GC.KeepAlive does nothing.  Instead, passing the this pointer to the method has the side effect of using that object reference in the code after completing stream’s MethodThatSpansGCs method.  This side effect means the lifetime of the object reference is extended to the call to GC.KeepAlive, so the GC will believe the Foo instance is live until after MethodThatSpansGCs has completed.  It is this side effect of extending the object lifetime that prevents the GC from running the finalizer, thus avoiding the race condition.  The method body for GC.KeepAlive is actually empty, but it needs to be flagged as not inlinable to ensure that JIT’s can’t detect that the object reference is dead. 

 

This is a lot of subtlety for users to deal with.  To solve this problem, we added in the HandleRef class in V1 as a convenient way of passing in an object reference and a handle to a P/Invoke method.  The P/Invoke marshaling layer is responsible for keeping that object reference alive for the duration of the method call.  But using HandleRef requires that instead of passing in a handle, you allocate a HandleRef instance passing in your handle and your object reference (commonly your this pointer) everywhere you use the handle.  I personally consider it a hack, even if it is marginally easier to use than using GC.KeepAlive.

 

Handle Recycling

Also in version 1, we discovered yet another problem with using IntPtr’s as handles.  The problem is a handle recycling attack, where if you have a handle wrapping type (like FileStream) that exposes a Dispose (or Close) method, you can close the handle on one thread while another thread is using the handle.  The key part of the attack is reading the IntPtr from managed code then placing it on the stack, just before you call your native P/Invoke call.  In FileStream, we call Win32’s ReadFile & WriteFile methods from FileStream’s Read and Write methods.  There is a window from after we place the handle value on the stack and before we call the OS where the handle can be closed by another thread, so we could call ReadFile or WriteFile with an invalid handle.  More pathologically, another thread could open another handle at the same time, and the OS could assign that new handle the exact value used by our other handle.  Windows does try to recycle handle values aggressively, so it is very likely that if your process closes handle 0xc, the next time you open a handle, you may get 0xc back as your new handle value. 

 

This handle recycling attack causes data corruption at least.  If this new handle was opened by a fully trusted caller, it could be referring to a file that isn’t accessible by another partially trusted caller running in the same process.  In this way, this could be an escalation of privilege attack. 

 

We discovered this in December of 2001, which was about one month before we shipped our product.  Seeing a scary bug like this crop up so close to shipping is always worrying.  But we ended up coming up with an isolated solution for this problem, and applied it to FileStream.  Our solution was to use a class called HandleProtector to essentially keep a reference count on all threads that are currently using a handle.  If someone attempts to close the handle, we don’t allow it to be closed until the last thread using the handle is finished.  In code, this involved sticking in a lot of Increment & Decrement calls around every P/Invoke call.

 

SafeHandle

In version 2, I pushed us to invent the SafeHandle class, which helps solve these two problems and others related to our reliability concerns.  SafeHandle is integrated well into P/Invoke, so every time you use the handle, we have the ability to increment & decrement a counter to track the number of threads using the handle.  SafeHandle has a finalizer on it to ensure that the handle is closed.  By creating a separate finalizable object that represents just a handle itself, most classes that wrap handles no longer need to provide finalizers themselves.  By removing that finalizer, you no longer have the race with your own finalizer that requires calls to GC.KeepAlive.  Also as a side effect of removing the finalizer from a handle wrapping class, you can often keep fewer GC objects alive while waiting for your finalizer to run.  This reduced object graph promotion is a secondary effect, but may help the GC reclaim memory much faster, especially if your object holds on to 4 KB buffers or large graphs of managed objects.

 

SafeHandle has other advantages related to our reliability work as well.  By using a derived type of SafeHandle in your P/Invoke prototypes, you get type safety among handles.  You can’t easily pass a handle to a semaphore to a method that takes a SafeFileHandle.  But the most important advantage is SafeHandle’s finalizer is guaranteed to run during AppDomain unloads, even in cases where a managed host may be assuming that the managed code within the process has corrupted state.  This is called a critical finalizer, which is a side effect your class can inherit by deriving from CriticalFinalizerObject.  This critical finalizer also has a weak ordering guarantee, stating that if a normal finalizable object and a critical finalizable object become unreachable at the same time, then the normal object’s finalizer is run first. 

 

I insisted on this weak ordering for critical finalizers for backwards compatibility reasons with FileStream.  In V1, if someone wrote data to a FileStream but never called Close, FileStream’s finalizer would get around to writing out the buffered data in FileStream, then closing the handle.  (Not calling Close like this is very questionable behavior, but we added some code to  protect poorly written applications from losing data in some of these scenarios.  At the time we designed SafeHandle, backwards compatibility with even potentially broken code seemed very important.)  If we move the handle from being a field in FileStream to being another class with its own finalizer, we couldn’t guarantee that FileStream’s finalizer could write any buffered data to disk because the handle’s finalizer may run first.  So we needed a special ordering guarantee with critical finalizers.  Our GC architect wasn’t about to even contemplate a strong ordering guarantee, where object A’s finalizer will always run before object B’s finalizer for any reasons like A has a reference to B.  That idea would hurt finalizer performance, possibly limit us to only ever using one finalizer thread, and isn’t even the right guarantee for code since sometimes A might be containing B, and other times A may have a back pointer to its container, B.  So we ended up with the weak ordering invariant I stated above.

 

Lastly, SafeHandle allows you to specify whether the handle is “owned” by this SafeHandle instance, or by something else.  This allows you to write code that consumes a SafeHandle from someone else, and you can close the SafeHandle without having to worry about whether you really control the lifetime of the handle yourself.  That decision should have been made by your caller.

 

Update: I wrote a blog entry on Constrained Execution Regions, which are another reliability primitive which is very useful.  Since a poster asked about it in a comment, some people may want to read this.

 

You can get a fuller appreciation for SafeHandle by understanding our reliability story for managed code, which also should appear in the .NET Framework Standard Library Annotated Reference, Volume 2, for the AppDomain class.