CLR Behavior on OutOfMemoryExceptions [Brian Grunkemeyer]

For out of memory exceptions, keep in mind that we can run out of memory in the native heaps in the process, as well as within the managed heap.  There are at least four interesting causes:

  1. Lack of available pages of memory, due to limited resources on the machine and/or competition between processes.
  2. Internal fragmentation within a process’s address space.
  3. Memory leaks
  4. A host aggressively monitoring the amount of available memory, and failing allocations to shoot for some specific memory usage target.

In the first case, you’re likely to have filled up all your physical memory and your swap file.  Your performance probably suffered long before you got here though, as your machine probably grinded to a halt swapping pages to disk.

The second case would be caused by memory fragmentation, and would result from allocation patterns like allocating many objects and freeing every other one, then allocating something that so large that it can’t fit in the remaining free space.  This can happen in long-running servers, and fortunately the managed heap isn’t directly affected by this because our garbage collector explicitly moves memory around, compacting objects so we get a large block of consecutive free space.  However, the address space of the process can also suffer from fragmentation.  The GC asks for pages from the OS in large chunks called segments, and for large allocations I think the segment size is 64 MB.  If the machine has available memory but the process’s address space doesn’t have 64 MB worth of consecutive pages available, the GC won’t be able to allocate memory.

Now let’s look at how a host (SQL Server, Exchange, ASP.NET, etc) can contend with other processes, or can aggressively manage its own memory.  The right way to think about this memory contention is in terms of what policies the affected applications have w.r.t. memory management.  SQL Server in particular operates under the assumption that it owns the entire box.  Beyond that, for performance reasons it tries to use all physical memory on the machine, and not a page more.  The more data it caches in memory, the better it can perform on lookups at a later point in time.  This means that if the database starts using too much memory, the server throttles back its own performance, failing allocations before they normally would fail in other applications, all in the name of preventing swapping.  This is a good approach for a server, but it does mean large scale databases should be on a dedicated server.  Another process with competing memory requirements will likely either hurt your database performance by swapping, or force OutOfMemoryExceptions to be more common, at least for managed code running in the database.

Implicit in all of this is what policy we use when an OutOfMemoryException is thrown.  We looked at what it would take to write reliable code for the CLR, and it turns out that out of memory is surprisingly more common than it would be in native applications, because allocations aren’t obvious in your source language.  Language constructs like boxing hide allocations from users.  Loading assemblies, jitting code, and operations like multidim array access and acquiring locks all allocate memory at least under certain conditions.  For this reason, we concluded that it was nearly impossible to write a very large body of code that can maintain its own consistency in the presence of OutOfMemoryException.  This is especially true for other asynchronous exceptions, like ThreadAbortException, which SQL Server uses to throttle itself.  As a result, we don’t think it is reasonable or even possible to ask users to write an entire application where they handle OutOfMemoryExceptions for every isolated operation, and attempt to continue gracefully.

Instead, our V2 hosting work for SQL Server was designed to allow SQL to mitigate failures for managed code running inside the database’s process, using appdomain unloading as the mitigation strategy.  Because SQL is a transacted environment, we have a higher level notion (the database transaction) which will ensure consistent behavior with any database writes.  Consistency of allocated objects and other non-persisted state is managed by figuring out whether a failure could have occurred while editing shared state.  We wanted to distinguish between simply failing to allocate an object used only within the body of a method vs. failing to handle some state change to a global dictionary.  Clearly the first is annoying, but the second one could be catastrophic.  As a heuristic for detecting when we’re editing shared state, we look at the thread that hit the OutOfMemoryException and see if it was holding any locks.  If so, we assume that it is editing shared state, and we escalate the OutOfMemoryException to an appdomain unload.  The database handles all the consistency details for persisted state.  This strategy works, as long as appdomain unloading doesn’t leak resources, and we’ve invested in a lot of infrastructure to make this happen (SafeHandle, Constrained Execution Regions, critical finalization, etc.)

With all this being said, we do have a mitigation technique other than appdomain unloading for out of memory.  We’ve added a MemoryFailPoint class, which will attempt to predict whether a memory allocation will fail.  You allocate a MemoryFailPoint for X megabytes of memory, where X is an upper bound on the expected additional working set for processing one request.  You then process the request, then call Dispose on the MemoryFailPoint.  If not enough memory was available, the constructor will throw an InsufficientMemoryException, which is a different exception type we created to express the concept of a “soft OOM”.  Apps can use this to throttle their own performance based on available memory.  The exception is thrown before we allocated memory when no corruption has yet occurred, implying that we haven’t corrupted any shared state.  Therefore, we do not need any escalation policy to kick in.  MemoryFailPoint does not reserve memory in the sense of reserving or committing physical pages of memory, so this technique is not iron-clad (you can get into races with other heap allocations in the process).  But it does maintain an internal process-wide reservation count, to keep track of all threads using a MemoryFailPoint in the process, and we believe it can reduce the frequency of “hard OOMs”, that would require some escalation policy.