Know Thine Implicit Allocations

Stephen Toub - MSFT

For .NET 4.5, we’ve invested quite a bit of effort into performance, and in particular for the Task Parallel Library (Joe Hoag wrote a good paper covering some of these improvements).  We focused such effort on TPL because it is a core component used in async programming and at a foundational level for many libraries and applications.  Small tweaks to the performance characteristics of TPL can thusly have a noticeable impact on the performance of these higher-level systems.

In particular, one of the areas we focused a lot of attention on is memory utilization: how many objects are we allocating, and how big are those objects.  This is primarily due to the subsequent costs of garbage collection necessary to clean up those objects… the less we allocate, the less often the GC will need to run and the shorter it will take to run.  Sometimes these allocations are clearly visible, in that you see a “new” expression in the code.  But often these allocations are hidden.

The Task Parallel Library is written in C#, and C# has some incredibly useful features that help to make programming very productive.  The implementation of those features, however, can sometimes result in allocations of which it’s easy to be blissfully unaware. The developers that wrote the C# compiler have done some great work to make the code generated by the compiler efficient, and they’ve implemented a variety of optimizations to help minimize allocations.  Still, if you care about eking out the best performance possible and minimizing all possible allocations in the process, it can be very useful to understand the intricacies of the code the compiler generates such that you can bend the behavior better to your will.

In this post, I’ll highlight some of these behaviors we came across during our performance push for TPL, and I’ll demonstrate how you can work around them to help get the best performance possible.

WARNING: Everything I’m discussing in this post is an implementation detail based on one version of the C# compiler (that in .NET 4.5 Developer Preview).  It’s certainly possible that these characteristics may change in a future release, as the C# compiler gets more and more optimizations added to it.

Closure and Delegate Allocations

Lambdas and anonymous methods are very powerful features that save the developer from having to write a lot of boilerplate code.  Consider the Framework’s ThreadPool.QueueUserWorkItem method:

public static bool ThreadPool(WaitCallback callback);

With lambdas/anonymous methods, I can write code like the following (I’m not recommending this as an approach towards writing to a stream asynchronously, rather just using it as an example of passing state around):

public static void WriteOnPool(Stream stream, byte [] data)
{
    ThreadPool.QueueUserWorkItem(delegate
    {
        stream.Write(data, 0, data.Length);
    });
}

Before lambdas/anonymous methods were introduced, C# developers wanting to call a method that accepted a delegate would typically need to define a class to store their state, add a method to that class to do their processing, and then create a delegate that pointed to that method on that state instance, e.g.

public static void WriteOnPool(Stream stream, byte [] data)
{
    var state = new WriteOnPoolState();
    state.stream = stream;
    state.data = data;
    ThreadPool.QueueUserWorkItem(new WaitCallback(state.Invoke));
}

private sealed class WriteOnPoolState
{
    public Stream stream;
    public byte [] data;

    public void Invoke(object ignored)
    {
        stream.Write(data, 0, data.Length);
    }   
}

Now in C#, when you use an anonymous method or lambda, the compiler actually ends up generating code almost identical to this on your behalf so that you no longer have to do so manually.  Here’s a decompiled version of what gets generated for my previous example that used an anonymous method:

public static void WriteOnPool(Stream stream, byte[] data)
{
    var locals2 = new DisplayClass1();
    locals2.stream = stream;
    locals2.data = data;
    ThreadPool.QueueUserWorkItem(
        new WaitCallback(locals2.<WriteOnPool>b__0));
}

[CompilerGenerated]
private sealed class DisplayClass1
{
    public Stream stream;
    public byte[] data;

    public void <WriteOnPool>b__0(object param0)
    {
        this.stream.Write(this.data, 0, this.data.Length);
    }
}

The important thing to note here is that there are two allocations involved: one for the closure object (in my hand-written version, WriteOnPoolState, and in the compiler-generated version, DisplayClass1), and one for the delegate to the method on the closure.  For a method that simply accepts a delegate, there’s really no way around two allocations: the delegate object needs to be allocated, and it needs to target an object containing the state (assuming that state needs to change for each invocation, this state object also needs to be allocated for each call).

Luckily, many library methods that accept delegates recognize the need to pass state into the method, and thus accept not only a delegate object but also a state object that will be passed into the delegate when it’s invoked.  In fact, ThreadPool has an additional overload of QueueUserWorkItem that accepts both, and the overload we’ve been using above just passes in null; that’s why the WriteOnPoolState.Invoke and DisplayClass1.<WriteOnPool>b__0 methods both accept an object parameter, so as to match the required signature of a WaitCallback, which accepts an object parameter.

This pattern of accepting an object state is useful because it allows us to avoid allocations.  The reason we need to allocate a delegate each time is because its Target needs to point to the right “this” object.  But if there is no “this”, i.e. we’re dealing with a static method, then we can reuse the same delegate object over and over, simply passing in different state as a parameter each time.  To achieve that, we can rewrite our WriteOnPool example as follows:

public static void WriteOnPool(Stream stream, byte [] data)
{
    var state = new WriteOnPoolState();
    state.stream = stream;
    state.data = data;
    ThreadPool.QueueUserWorkItem(s_writeOnPoolBody, state);
}

private static readonly WaitCallback s_writeOnPoolBody = 
    new WaitCallback(WriteOnPoolBody);

private static void WriteOnPoolBody(object state)
{
    var poolState = (WriteOnPoolState)state;
    poolState.stream.Write(poolState.data, 0, poolState.data.Length);
}   

private sealed class WriteOnPoolState
{
    public Stream stream;
    public byte [] data;
}

We’re now still allocating a delegate and a state object, but we allocate the delegate only once and cache it into a static field.  On each call to WriteOnPool, then, we only allocate once, for the WriteOnPoolState object, saving half of the allocations.

As it turns out, the C# compiler has smarts to provide similar savings when using lambdas and anonymous methods.  Let’s say I instead rewrite my implementation as follows, still using a lambda, but now passing in my state via the object state parameter and ensuring that my lambda doesn’t close over anything (I’m also using a Tuple<,> for conciseness to avoid needing to define my own custom state type):

public static void WriteOnPool(Stream stream, byte[] data)
{
    ThreadPool.QueueUserWorkItem(state =>
    {
        var poolState = (Tuple<Stream,byte[]>)state;
        poolState.Item1.Write(
            poolState.Item2, 0, poolState.Item2.Length);
    }, Tuple.Create(stream, data));
}

This now results in the following code from the compiler:

public static void WriteOnPool(Stream stream, byte[] data)
{
    if (CachedAnonymousMethodDelegate1 == null)
        CachedAnonymousMethodDelegate1 = 
            new WaitCallback(Program.<WriteOnPool>b__0);

    ThreadPool.QueueUserWorkItem(
        CachedAnonymousMethodDelegate1, 
        Tuple.Create<Stream, byte[]>(stream, data));
}

[CompilerGenerated]
private static WaitCallback CachedAnonymousMethodDelegate1;

[CompilerGenerated]
private static void <WriteOnPool>b__0(object state)
{
    Tuple<Stream, byte[]> poolState = (Tuple<Stream, byte[]>) state;
    poolState.Item1.Write(poolState.Item2, 0, poolState.Item2.Length);
}

Note that the compiler has done something very similar to what we did manually, creating a static field to cache the delegate.  The primary difference between what we manually wrote and what the compiler generated is the compiler is lazily instantiating the delegate on first use, rather than doing it in the static constructor as we did by using a field initializer.  So, we get to have our cake and eat most of it, too… we still get most of the convenience of lambdas while getting the efficiency benefits of the delegate caching.  And, of course, if the state I’m trying to pass in is already a single object, that object can be passed in directly (rather than wrapping it in a state object), bringing us down to zero additional allocations.

When Automatic Caching Fails

This is all well and good, but it can lead us into a fall sense of security, as there are times when the compiler will not cache the delegate.

One prime example is that the compiler won’t cache the delegate when a method group is used rather than a lambda / anonymous method, e.g.

public void WriteOnPool(Stream stream, byte[] data)
{
    ThreadPool.QueueUserWorkItem(
        WriteOnPoolCallback, 
        Tuple.Create(stream, data));
}

private static void WriteOnPoolCallback(object state)
{
    var poolState = (Tuple<Stream, byte[]>)state;
    poolState.Item1.Write(
        poolState.Item2, 0, poolState.Item2.Length);
}

Some developers prefer the explicitness of having a named method, but the compiler currently generates code like the following for this:

public void WriteOnPool(Stream stream, byte[] data)
{
    ThreadPool.QueueUserWorkItem(
        new WaitCallback(Program.WriteOnPoolCallback), 
        Tuple.Create<Stream, byte[]>(stream, data));
}

private static void WriteOnPoolCallback(object state)
{
    Tuple<Stream, byte[]> poolState = (Tuple<Stream, byte[]>) state;
    poolState.Item1.Write(
        poolState.Item2, 0, poolState.Item2.Length);
}

Notice the delegate allocation on each call.  In such a situation, you might prefer to do the caching manually, or to revert to using a lambda or anonymous method.

Another example where the compiler won’t cache is when dealing with certain patterns of generics usage.  Consider the following variation.  Here rather than writing a byte[] to the stream, I’m using a BinaryFormatter to serialize an arbitrary generic T to the stream:

public void WriteOnPool<T>(Stream stream, T data)
{
    ThreadPool.QueueUserWorkItem(state =>
    {
        var poolState = (Tuple<Stream, T>)state;
        new BinaryFormatter().Serialize(
            poolState.Item1, poolState.Item2);
    }, Tuple.Create(stream, data));
}

The compiler will currently not cache a delegate for this case, instead generating code like the following:

public void WriteOnPool<T>(Stream stream, T data)
{
    ThreadPool.QueueUserWorkItem(
        new WaitCallback(Program.<WriteOnPool>b__0<T>), 
        Tuple.Create<Stream, T>(stream, data));
}

[CompilerGenerated]
private static void <WriteOnPool>b__0<T>(object state)
{
    Tuple<Stream, T> poolState = (Tuple<Stream, T>) state;
    new BinaryFormatter().Serialize(
        poolState.Item1, poolState.Item2);
}

A little thought reveals why the compiler doesn’t use the same caching pattern it does elsewhere.  Notice that the method generated for the lambda is a generic method, and for it to be cached, it would need to be cached in a static field for each T that might exist… that’s not possible in a field on a class that’s not parameterized on the same generic type parameter.  We can manually work around this by caching the delegate manually: we just need to create a generic type to contain the field:

public void WriteOnPool<T>(Stream stream, T data)
{
    ThreadPool.QueueUserWorkItem(
        DelegateCache<T>.s_writeOnPoolCallback, 
        Tuple.Create(stream, data));
}

private static class DelegateCache<T>
{
    internal static readonly WaitCallback s_writeOnPoolCallback =
        state =>
    {
        var poolState = (Tuple<Stream, T>)state;
        new BinaryFormatter().Serialize(
            poolState.Item1, poolState.Item2);
    };
}

Theoretically, the compiler could perform a similar transformation, for example creating one such generic cache per assembly and per generic arity, but it currently does not do so.

Another common case where such lack-of-caching goes unnoticed is when dealing with instance methods and data.  Consider the following similar code:

internal class MyCoolType
{
    private Stream m_stream;
    private byte[] m_data;

    public void WriteOnPool()
    {
        ThreadPool.QueueUserWorkItem(delegate
        {
            m_stream.Write(m_data, 0, m_data.Length);
        });
    }
}

This example will generate code like the following:

internal class MyCoolType
{
    private Stream m_stream;
    private byte[] m_data;

    public void WriteOnPool()
    {
        ThreadPool.QueueUserWorkItem(
            new WaitCallback(this.<WriteOnPool>b__0));
    }

    [CompilerGenerated]
    private void <WriteOnPool>b__0(object param0)
    {
        this.m_stream.Write(this.m_data, 0, this.m_data.Length);
    }
}

The anonymous method refers implicitly to ‘this’, closing over member fields of the current instance.  As such, the compiler generates an instance method for that anonymous method, and then because it needs a delegate instance with a Target set to this particular instance, each call to WriteOnPool will end up allocating a new delegate (it would be possible for the compiler to have added an instance field in order to cache this delegate, but that could have negative consequences, such as making each instance of this class bigger, and the compiler can’t easily reason about the global impacts of such caching).  Of course, we have knowledge that the compiler doesn’t have: we know that we can use the object state parameter in order to pass data into the method.  So, we could instead write our code as:

internal class MyCoolType
{
    private Stream m_stream;
    private byte[] m_data;

    public void WriteOnPool()
    {
        ThreadPool.QueueUserWorkItem(state =>
        {
            MyCoolType mct = (MyCoolType)state;
            mct.m_stream.Write(mct.m_data, 0, mct.m_data.Length);
        }, this);
    }
}

or as:

internal class MyCoolType
{
    private Stream m_stream;
    private byte[] m_data;

    public void WriteOnPool()
    {
        ThreadPool.QueueUserWorkItem(
            state => ((MyCoolType)state).Write(), this);
    }

    private void Write()
    {
        m_stream.Write(m_data, 0, m_data.Length);
    }
}

and in both cases, we’ll get the caching we expect.  This is an example of the case I referred to earlier, where we already have a single state object to pass in, and thus we don’t need to allocate a wrapper (e.g. a tuple), which means there are no additional allocations here.

Closures and Fast Paths

There are other times when the compiler isn’t yet quite as clever as we might expect it to be.  One case in which this comes up is when using “fast paths” to optimize frequented code paths.  Consider a change to our WriteOnPool method: the goal of the method isn’t to queue a work item to do the write, but rather to ensure the write happens on a ThreadPool thread; as such, if we’re already on a ThreadPool thread, we can avoid the queueing operation:

public static void WriteOnPool(Stream stream, byte[] data)
{
    if (Thread.CurrentThread.IsThreadPoolThread)
    {
        stream.Write(data, 0, data.Length);
    }
    else
    {
        ThreadPool.QueueUserWorkItem(delegate
        {
            stream.Write(data, 0, data.Length);
        });
    }
}

I might then expect that my “fast path” that occurs when I’m already on a ThreadPool thread won’t involve any additional allocations for the closure/delegate used on the “slow path.”  But such an expectation is based on an assumption that the compiler allocates the closure in the “else” block.  In fact, the compiler currently generates code like the following:

public static void WriteOnPool(Stream stream, byte[] data)
{
    var locals3 = new DisplayClass2();
    locals3.stream = stream;
    locals3.data = data;
    if (Thread.CurrentThread.IsThreadPoolThread)
    {
        locals3.stream.Write(locals3.data, 0, locals3.data.Length);
    }
    else
    {
        ThreadPool.QueueUserWorkItem(
            new WaitCallback(locals3.<WriteOnPool>b__0));
    }
}

[CompilerGenerated]
private sealed class DisplayClass2
{
    public byte[] data;
    public Stream stream;

    public void <WriteOnPool>b__0(object param0)
    {
        this.stream.Write(this.data, 0, this.data.Length);
    }
}

Note that the “locals” that are captured into the closure actually live on the heap-allocated closure object and are accessed there, even from code paths that don’t use the closure. Thus, the closure object gets allocated at the beginning of the method.  This means that we’re still paying for the closure object even if we only take the “fast path” that doesn’t actually require it.  If we instead separate out the closure into a separate method, we can eliminate that allocation from the fast path, e.g.

public static void WriteOnPool(Stream stream, byte[] data)
{
    if (Thread.CurrentThread.IsThreadPoolThread)
    {
        stream.Write(data, 0, data.Length);
    }
    else WriteOnPoolPrivate(stream, data);
}

private static void WriteOnPoolPrivate(Stream stream, byte[] data)
{
    ThreadPool.QueueUserWorkItem(delegate
    {
        stream.Write(data, 0, data.Length);
    });
}

Moral of the Story

C# and Visual Basic provide some very powerful language features for you to use, and they typically do a very good job of implementing what’s under the hood.  But if your particular scenario demands pedal-to-the-medal performance, you need to at least understand what’s happening on your behalf so that you can better influence it as necessary.  Get friendly with tools like .NET Reflector that help you to see what’s actually going on in the compiled code.

If you do use a decompiler for this kind of analysis, be certain that it’s not sweeping such details under the rug for you.  A primary use of these decompilers is to get high-quality source generated from low-level IL, and thus decompilers will often attempt to decompile back to source that employs anonymous methods and lambdas and the like.  If instead your goal is to get a closer view on what’s actually happening, try tweaking the settings of the decompiler to minimize what language features it reverse engineers to, e.g. in .NET Reflector, you can change the “optimization level” so that it shows you code like what I’ve shown in this post, rather than hiding these allocations behind a lambda.

One last thought. Throughout this post, I kept harping on the number of allocations, but I didn’t spend much time on the size of the allocations. The number of allocations is only one aspect to consider. We also want to consider the amount of memory being allocated, as that will factor into how often the garbage collector runs. There are lots of tools to help determine the amount of memory .NET applications are allocating, but when prototyping I often find myself using a function like the following to give me a rough sense for what something is costing:

public static class MemoryAnalysis
{
    public static long AveragedApproximateSize<T>(Func<T> rootGenerator) where T : class
    {
        const int iters = 1000000;

        var items = new object[iters];
        long start = GC.GetTotalMemory(true);

        for (int i = 0; i < items.Length; i++)
            items[i] = rootGenerator();

        long end = GC.GetTotalMemory(true);
        GC.KeepAlive(items);

        return (long)Math.Round((end – start) / (double)iters);
    }
}

When I then run:

static void Main()
{
    Console.WriteLine(MemoryAnalysis.AveragedApproximateSize(
        () => new WriteOnPoolState()));
    Console.WriteLine(MemoryAnalysis.AveragedApproximateSize(
        () => new WaitCallback(WriteOnPoolBody)));
}

I see that WriteOnPoolState is 16 bytes and WaitCallback is 32 bytes (in a 32-bit compilation), which makes logical sense if you add up the size of the fields on the objects plus the object headers. That means that for the examples where we were able to eliminate the delegate allocation by caching the delegate in a static, we were not only saving half the number of allocations, but we were also saving 2/3rds of the memory being allocated.

Happy optimizing.

0 comments

Discussion is closed.

Feedback usabilla icon