Constrained Execution Regions and other errata [Brian Grunkemeyer]

A customer recently asked a good question about some of our new reliability features in Whidbey:

There are calls to Thread.BeginCriticalRegion() inside the Hashtable class. Can you clarify why they are used and how this is different from RuntimeHelpers.PrepareConstrainedRegions() and try/finally block ?

There are two very different concepts, though they do have confusingly similar names.  The first is the extremely loose concept of a critical region, the second is a reliability primitive called a constrained execution region.  Both are interesting from a reliability perspective.

Escalation Policy and Hosting

If you've already read Chris Brumme's blog entry on hosting (and understood it!), you can skip this section.

First, I must summarize our escalation policy for a host like SQL Server.  SQL Server will host managed stored procedures in-process for high performance, but also wants to be sure there will be no reliability or scalability problems with the managed code they run.  To enable this, the SQL Server team has a somewhat restrictive programming model that prevents stored procedures from launching new threads or using static variables.  (Some of this depends on exactly which of the three permission buckets your code is executed in). 

Now consider the effect of an async exception, such as an OutOfMemoryException in the middle of some managed method.  The CLR may throw asynchronous exceptions from a surprising number of places (especially ThreadAbortException and StackOverflowException), which makes it very difficult to predict or work around.  ThreadAbortException (in one form or another) can appear between just about any two machine instructions in your code, and this truly unpredictable nature is almost impossible to deal with.  In effect, we believe writing reliable backout code to handle an exception anywhere is nearly impossible - instead we invented other tools for reliability purposes.  However, SQL Server has a few advantages that make these problems easier to deal with, leading to a mitigation strategy we call an escalation policy.

SQL Server is a transacted environment, and their transaction tracking is all done in native code, so they can always tell whether a transaction must be rolled back.  This helps because it means if we detect a failure, we can always roll back to a known good state and start again.  The restrictive programming model is great, because it somewhat limits the number of ways an application can fail.  But it isn't sufficient.  Consider code that is modifying a collection.  This code could fail due to an async exception in the middle of its operation, leaving the collection in a partially modified condition.  This corruption (or lack of consistency) may mean any further use of this collection will lead to crashes, data corruption, or other combinations of bad behavior.  If this collection was only used by one thread, this isn't a major problem - we'll abort that thread and the collection will go away during the next GC, without any other threads being able to access it. 

However, if this collection was shared between threads, this corruption is now visible to other threads within the process.  Corrupting shared state is very bad - not only do we have the possibility that a single thread is corrupt, but all threads within that appdomain may now get corrupt data.  We need a way to contain this corruption, and to mitigate its effects.  This is where we use application domains (AppDomains for short).

The core idea behind our escalation policy is if we detect that we might be corrupting shared state in an appdomain when throwing an async exception, we will escalate that async exception to an appdomain unload.  (We have some other notions of rude thread aborts and rude appdomain unloads, but they're not interesting for this part of the discussion.)  This way, because we're running in a transacted environment, we can simply kill all our threads to prevent forward progress on potentially corrupt state, and let SQL Server roll back any transactions that were in progress.  This way, we mitigate the corruption of shared state.

The only question then is how to detect corruption of shared state.  We don't have a perfect rule for this, but we came up with a good heuristic that may include some false positives.  The natural thing to do when editing shared state in a non-atomic fashion is to take a lock.  So if we ever throw an async exception while a thread holds a lock, we pessimistically assume that we have corrupted shared state, so we escalate that async exception to an appdomain unload.

On a side note, I use this as an interview question, in large part because there's almost no way to study for something like this, short of reading many specs, blogs, or patents describing how technologies work, so most people start out on an equally poor grounding to tackle this problem.  I've been told I can be cruel.

On a separate note, SQL Server also wants to control memory allocation for their process - they want to own all the available physical memory on the machine, but not a page more.  They don't want to page to disk for performance reasons.  So all allocations pass through the host (or at least in ~16 MB segments). 

Critical regions

Critical regions don't really do much.  They have two effects, and they require a host for the CLR (such as SQL Server) to observe, as they are important for a host's policy for handling async exceptions (which may lead to corruption of shared state). 

The first effect for critical regions is a way of informing the CLR that a lock is being held by a block of managed code.  This is required for all managed locks that the CLR doesn't explicitly know about (ie, any time you write your own spinlock or other synchronization primitive instead of using one built into the Base Class Library).  The CLR needs to know when a thread is holding a lock to determine whether the thread is holding shared state, so a host can decide whether to escalate async exceptions to appdomain unloads.  So if you write your own lock, please use Thread.BeginCriticalRegion when you acquire the lock, and EndCriticalRegion when you release the lock.

The second effect is that memory allocations within that block of code are marked as appdomain-critical.  When the GC needs more memory and the CLR is hosted, the host can be responsible for handing memory to the GC.  When asking for memory, the host receives a hint about how important that allocation is, and allocations from within a critical region are marked as appdomain critical, meaning the appdomain will most likely have to be torn down if this allocation fails.  The host can then work harder to satisfy this memory allocation (since tearing down an appdomain represents potentially a lot of lost work, or at least significantly more than when a single thread is aborted), but of course, this is no guarantee that the host is able to or will choose to provide memory to the GC.

Note that critical regions may be poorly named, because they don't make any guarantees that your code will execute.  In fact, we only store this information as a ref count on the thread, to let us know whether the thread holds any locks.

Constrained Execution Regions

After reading the section on how a host can abort threads, you may start to think that you as a library writer need a way to combat the host.  SQL Server depends on the ability to recycle appdomains, unloading them then restarting them.  This appdomain recycling must be done cleanly, without process-wide state corruption or leaking memory.  This is very difficult to achieve in the face of async exceptions, especially if the host is actively trying to abort your thread!  So we need a reliability primitive to help us.  Without it, you can't reliably allocate a native handle and store it in a field, since conceptually a rude thread abort can occur between virtually any two machine instructions.

Constrained execution regions (CER's) exist to help a developer write her code to maintain consistency.  The CLR doesn't guarantee that the developer's code is correct, but the CLR does hoist all of the runtime-induced failure points (ie, async exceptions) to either before the code runs, or after it has completed.  Combined with constraints on what the developer can put in a CER, these are a useful way of making strong guarantees about whether your code will execute.  CER's are eagerly prepared, meaning that when we see one, we will eagerly JIT any code found in its statically-discoverable call graph.  If the CLR's host cares about stack overflow, we'll probe for some amount of stack space as well (though perhaps not enough stack space for any arbitrary method*).  We also delay thread aborts until the CER has finished running. 

Developers using CER's must obey strict rules as well - they cannot call arbitrary virtual methods (unless they have been eagerly prepared), they cannot allocate memory (along correct code paths), and CER code should only call methods with suitably strong reliability contracts.  The restrictions on not allocating memory are rather pervasive, when you realize that the CLR allocates memory whenever it takes a lock, during multidimensional (not jagged) array accesses, or whenever your compiler has added a box instruction to your code.  Also, certain P/Invoke marshaling styles require allocations, such as marshaling a String as an ANSI string.

CER's are exposed in three interesting forms in Whidbey:

  • ExecuteCodeWithGuaranteedCleanup, a stack-overflow safe form of a try/finally.
  • A try/finally block preceded immediately by a call to RuntimeHelpers.PrepareConstrainedRegions.  The try block is not constrained, but all catch, finally, and fault blocks for that try are.
  • As a critical finalizer - any subclass of CriticalFinalizerObject has a finalizer that is eagerly prepared before an instance of the object is allocated.
    • A special case is SafeHandle's ReleaseHandle method, a virtual method that is eagerly prepared before the subclass is allocated, and called from SafeHandle's critical finalizer.

An example of where one could use CER's is when making edits to more than one field of a data structure in an atomic fashion. I worked on a ReliableArrayList prototype to help examine some of our reliability primitives (and an earlier, successful experiment demonstrating we lacked a suitable reliability primitive). Here's ReliableArrayList's Insert method:

 

        [ReliabilityContract(Consistency.WillNotCorruptState, CER.MayFail)]
public virtual void Insert(int index, Object value) {
// Note that insertions at the end are legal.
if (index < 0 || index > _size) throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_ArrayListInsert"));
if (_size == _items.Length) EnsureCapacity(_size + 1);

            RuntimeHelpers.PrepareConstrainedRegions();

            try {

            }

            finally {

                // Forward progress

                if (index < _size) {

                    Array.ConstrainedCopy(_items, index, _items, index + 1, _size - index);

                }

                _items[index] = value;

                _size++;

                _version++;

            }

        }

Note that forward progress is made in a finally block here, to guarantee that async exceptions aren't thrown during that block of code.  Also, Array.Copy has a reliability contract saying it may corrupt the array instance, because Array.Copy supports conversions from (perhaps too) many different array types, some of which require an allocation.  I needed a Copy method with a stronger reliability contract, so I made ConstrainedCopy that isn't as flexible.

Here's an example of some reliable backout code that explicitly relies on a virtual method call (IList's RemoveAt), so it eagerly prepares it beforehand.

                RuntimeHelpers.PrepareConstrainedRegions();
                try {
                    // Prepare my backout code
                    MethodInfo m = _list.GetType().GetMethod("RemoveAt", new Type[] { typeof(int) });
                    RuntimeHelpers.PrepareMethod(m.MethodHandle);

                    IEnumerator en = c.GetEnumerator();
                    while(en.MoveNext()) {
                        _list.Insert(index++, en.Current);
                        // Assuming that these lines aren't reordered.
                        numAdded++;
                    }
                    _version++;
                }
                catch(Exception) {
                    // Reliable backout code
                    while(numAdded > 0) {
                        _list.RemoveAt(index--);
                        numAdded--;
                    }
                    throw;
                }

The above example should probably be using ExecuteCodeWithGuaranteedCleanup instead, in case a stack overflow occurs in the try block.

Ravi has some more examples of CER's in his blog entry discussing SafeHandles:  https://blogs.msdn.com/bclteam/archive/2005/03/15/396335.aspx

Reliability Contracts

We added in reliability contracts to describe to a method's caller whether the method is expected to succeed when called from a CER (for some method-appropriate definition of success), and what level of consistency is guaranteed when this method is called from within a CER.  One of the problems we dealt with when designing CER's is that many constrained execution regions may want to call methods that validate their input and throw exceptions, like the above example that calls Array's ConstrainedCopy method.

Who should use a Constrained Execution Region?

The above examples are about the closest thing to rocket science that an end user of the CLR may run into at this point (short of some COM interop problems).  Constrained execution regions do a great job at hoisting CLR-induced failures, and SafeHandle is a great reliability primitive for cleaning up OS resources.  However, there is no silver bullet for reliability problems - all we can do is provide tools to help move failure points to more predictable places.  We expect that typically only library authors will need to know about them in this release.  In a future release, ideally we could have better language integration, but it isn't clear that it is worth the effort.

Any place where you edit process-wide state, you need a CER.  You also need a CER if you're allocating an OS resource but aren't using SafeHandle to guarantee that the resource will be freed.  There are a few relatively obscure P/Invoke cases where the marshaling layer can't deal with SafeHandles, such as a struct containing an "out" handle that must be marshaled back into managed code.  Other than that, hopefully most users won't have a need for this level of reliability.  The restrictions on what you can call and sometimes the performance impact of using a CER are significant barriers that should cause most users to file this under "nice things to know, but I hope I never need". 

After this lengthy diversion - What about Hashtable?

Now that I've explained what constrained execution regions are and why we have Thread.BeginCriticalRegion, I can answer the original question.  Why do we use them in Hashtable?  This is to satisfy a perhaps poorly thought through threading guarantee for Hashtable.  In version one, we thought we could guarantee that Hashtable was safe for multiple readers and one concurrent writer, but not truly thread-safe (ie, multiple writers were NOT safe without taking a lock).  However, we didn't do a good job with this in V1.  We eventually added some code to Hashtable to set a bit saying whether a writer was active, relying on the CLR's virtual keyword (which is different than C++'s virtual keyword) to ensure that we'd eventually read & write the correct results to memory in the right order.  This code spins on the value of a flag indicating whether a writer is changing the Hashtable, and loops if the Hashtable has changed underneath it.  This isn't a truly great idea - a lock would have fit much better.  But since we are simulating a lock without actually taking a lock, we're using Thread's BeginCriticalRegion and EndCriticalRegion to inform the CLR that we are holding a "lock" here.  This was important, since a thread that was writing to the Hashtable could take the fake lock then get aborted.  We didn't release the "lock" then, causing other threads waiting on the Hashtable to deadlock.  With the BeginCriticalRegion code there, we consider that fake lock a real lock for the intents of escalation policy, so we escalate to an appdomain unload.  Assuming the Hashtable is only used within that appdomain and not cross-process (an excellent assumption since Hashtable is not marshal-by-reference), then we can avoid deadlocks.

You might ask why would anyone jump through hoops like this in the first place.  We developed Hashtable in V1 and realized that lookup performance was one of our key benchmarks for our performance scenarios.  At the time, Monitor.Enter was relatively expensive, even without contention on the lock.  This was fixed in V1.1 with a thin lock implementation, where we upgrade to a fat lock if we have contention, but the simple case is pretty fast.  (Incidentally, this upgrading to a fat lock is why you can't take a lock in a CER - it allocates an internal object called a sync block, and once the contention has passed, we downgrade back to a thin lock.  So you can't "eagerly prepare" a lock in the current version of the CLR.)

There's been some talk about changing this admittedly strange code in Hashtable (also note that Dictionary<K, V> does NOT make the same guarantee).  One person on our team suggested removing it (and with it, the thread safety guarantee that we made but didn't fully deliver on), but that was shot down because it would likely break a lot of code.  I don't know if someone considered turning this into a real lock, using Monitor.Enter (via C#'s lock keyword) on the Hashtable's SyncRoot property.  I'll see if we can firm this up before Whidbey ships.

* If you need to survive stack overflows and call arbitrary user code, look at RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup, which is similar a CER try/finally split out to use delegates.  The cleanup code will run even if there was a stack overflow arbitrarily deep in the try code.  However, the method you use for the cleanup code needs to have the PreprepareMethodAttribute on it, which has the unfortunate side effect that you can't use C#'s anonymous delegate syntax.

You can get a fuller understanding our reliability story for managed code by reading my comments for the AppDomain class in the upcoming book,  .NET Framework Standard Library Annotated Reference, Volume 2.  You may also want to read through Chris Brumme's blog entries on hosting and reliability.