A common execution path optimization


Today I want to talk about one interesting optimization pattern that you may face in framework code or in high-performance libraries.

The idea is simple: suppose you have a commonly used method that has two execution paths – one is very common and simple, and the second one takes longer to execute, has more steps but happens not that often.

As an example, let’s consider a List<T>.Add method. On the “happy-path” Add method is very simple and efficient: if there is enough space it adds an item to an internal array. But if the internal array is full, Add method resizes the underlying array (by allocating another array double the size) and then adds an element to it.

Here is a simple implementation of this method:

public void Add(T item)
{

    if (_size < _items.Length)
    {
        // Common path: the array has enough space
        _items[_size++] = item;
        return;
    }

   
// Corner case: need to resize the array
    int newCapacity = _items.Length == 0 ? _defaultCapacity : _items.Length * 2;

   
T[] newItems = new T[newCapacity];
    if (_size > 0)
    {
        Array.Copy(_items, 0, newItems, 0, _size);
    }

   
_items = newItems;
    _items[_size++] = item;
}

Unfortunately, this implementation is not eligible for method inlining by the JIT compiler, because it is too large. Even if we try to “force” the method inlining by using MethodImplOptions.AggressiveInlining nothing good will happen. As we’ll see at the end of the post, inlining big methods doesn’t improve performance.

There is no official documentation regarding the JIT compiler optimizations. But there are enough unofficial documents (like blog-posts) that cover current behavior. The JIT compiler won’t inline a method in the following cases:

  • Method is marked with MethodImplOptions.NoInlining
  • Method body is larger than 32 bytes of the IL code
  • Virtual method calls including interface method invocation
  • Method has complex flow control like switch, while
  • Method has exception handling logic

Add method shown above fits into the second rule and won’t be inlined due to its size.

We know that the method has two cases: light-weight common path when the list has enough space and the rare case when the list should be resized. Based on this knowledge we can extract the logic for a rare case and leave the happy-path code as is:

public void Add(T item)
{

    if (_size < _items.Length)
    {
        // Common path: the array has enough space
        _items[_size++] = item;
        return;
    }

   
// Rare case: need to resize the array
    AddItemSlow(item);
}


private void AddItemSlow(T item)
{

    int newCapacity = _items.Length == 0 ? _defaultCapacity : _items.Length * 2;
    T[] newItems = new T[newCapacity];
    if (_size > 0)
    {
        Array.Copy(_items, 0, newItems, 0, _size);
    }

   
_items = newItems;
    _items[_size++] = item;
}

The change is very simple, but it makes the method small enough for method inlining by the JIT compiler. We can check the difference using the amazing benchmarking tool – BenchmarkDotNet:

const int DefaultCapacity = 100;
const int NumberOfElements = 500000;

[Benchmark]
public void InlinableList()
{

    var list = new OptimizedList<int>(DefaultCapacity);

   
for (int i = 0; i < NumberOfElements; i++)
    {
        list.Add(i);
    }
}


[Benchmark]
public void NonInlinableList()
{

    var list = new NonOptimizedList<int>(DefaultCapacity);

   
for (int i = 0; i < NumberOfElements; i++)
    {
        list.Add(i);
    }
}
           Method |      Mean |    StdErr |
----------------- |---------- |---------- |
    InlinableList | 3.3074 ms | 0.0373 ms |
NonInlinableList | 3.9557 ms | 0.0097 ms |

As you can see the difference is pretty significant (about 20% for a method that just adds a 500 000 elements). You may wander, why the difference is so big? The asymptotic complexity for the Add method is O(N), but the amortized complexity is O(1). The underlying array grows exponentially, so in the vast majority of cases adding an element to the list so cheap that the method call overhead starts playing a reasonable role.

In Visual Studio, you can check that the JIT compiler actually inlines the method. To do that, add a breakpoint to the both benchmark method, run the app in Release mode, and then switch to Disassembly window (Debug => Windows => Disassembly). But be aware, that you need to disable two options Tools => Options => Debugging => General menu: ‘Suppress JIT optimization on module load’ and ‘Enable Just My Code’.

When to use the pattern?

This is not a common pattern that would be helpful in every application. It may be useful for extremely hot paths in end-user production code. However it is more likely to be helpful for high-performance libraries and frameworks where even slight overhead is critical since they may be called on the hot path.

For instance, this pattern is used in the BCL and several places in TPL Dataflow (StartTaskSafe, OfferAsyncIfNececssary, ProcessAsyncIfNecessary etc):

internal static Exception StartTaskSafe(Task task, TaskScheduler scheduler)
{

    if (scheduler == TaskScheduler.Default)
    {
        task.Start(scheduler);
        // We don't need to worry about scheduler exceptions
        // from the default scheduler.
        return null; 
   
}
    // Slow path with try/catch separated out so that
    // StartTaskSafe may be inlined in the common case.
    else return StartTaskSafeCore(task, scheduler);
}

You may consider using this pattern if you have a highly-used member with two distinct scenarios: fast and common, and heavyweight and rare. If the heavyweight part is not inline friendly due to its size or language constructs (like, try/finally) then splitting methods in two will help the CLR JIT to inline the whole method.

Beware of too much inlining

You may think that the method inlining is such a good thing, that frameworks and libraries should enforce the runtime to inline as much as possible using MethodImplOptions.AggressiveInlining that was added in the .NET 4.5. First of all, the attribute removes only the restriction for method inlining based on a method size and won’t force the runtime to inline virtual methods (even ‘sealed’ ones), or methods with complicated control flow or exception handling.

But the main reason for not using the attribute without careful measurement is the speed. The method inlining is double-edged sword. Method inlining reduces number of instructions but can make the resulting code bigger.

For instance, we can force method inlining for our original implementation of the Add method, but it won’t yield any performance benefits. Actually, the overall performance would be slightly worse (for more information see “To Inline or not to Inline: That is the question” by Vance Morrison):

                             Method |      Mean |    StdErr |    StdDev |    Median |
----------------------------------- |---------- |---------- |---------- |---------- |
                   NonInlinableList | 3.7030 ms | 0.0116 ms | 0.0402 ms | 3.6858 ms |
OriginalListWithAggressiveInlining | 3.8335 ms | 0.0381 ms | 0.2347 ms | 3.7086 ms |
                      InlinableList | 3.3077 ms | 0.0457 ms | 0.2238 ms | 3.1836 ms |

Additional links and resources


Comments (12)

  1. ssylvan says:

    Another trick that may help in this case is to reduce the dependencies on that branch. In this case, the way that would look is that you always make sure there's at least one free slot in the array, which makes the write unconditional. That way, the "fast path" does no branching, it just writes to the next free slot (unconditionally, since there always is one), then after that it checks if it needs to resize the array. This means the resize condition gets triggered one element sooner (i.e. you resize when length == capacity after you've already added the item), but in return the common case is unconditional.

    1. Thanks, ssylvan, this could be a good trick as well.

    2. Maxim Kornilov says:

      Smart move on the first glance, but actually it may hurt the performance. Imagine that we know how many elements we need.
      var list = new List(100);
      for (int i = 0; i < list.Count; i++)
      {
      list.Add(i);
      }
      As a result we have a list with 200 elements and half of them are not used.
      Some people will say that we should use an array in such cases or every good developer should know at least one level down the abstraction. But still, not good.

      1. @Maxim: I agree that this could be weird, but in your case, I would suppose that the list will allocate 101 element immediately and won't grow when the 100th element would be added.

        And I agree with your concern in general and I would avoid using suggested optimization in the real world because it is leaky.

  2. Kirill says:

    Very interesting! Thanks, especially for benchmark tool!

    1. Thanks, Kirill, glad that you liked the post.

  3. Arash Partow says:

    If possible/available could please you provide the 99.9, 97, 95, 75 and 50 quartiles of the results. As they would be more informative than 'mean' in this particular context.

    1. I'm using BenchmarkDotNet for testing and I'm not sure that this tool provides this flexibility. But I don't think this matters just because the StdDev is fairly small.

      1. Matt Warren says:

        @Arash, @Sergey

        Actually BenchmarkDotNet does give you percentiles, you just need to add them to your config, they're not enabled by default. See http://benchmarkdotnet.org/Advanced/Percentiles.htm for more info

        1. Thanks a lot, Matt, this is very helpful.

  4. MuiBienCarlota says:

    In your example, AddItemSlow could be a good candidate to be an local function ie declared inside of Add.
    Do you know if a C# 7 local function have an impact on inlining?

    1. I haven't check but I'm almost sure that using local functions will lead to the same result.

Skip to main content