In a previous posting, I mentioned that .NET V3.5 Service Pack 1 had significant improvements in the Just in time (JIT) compiler for the X86 platform, and in particular its ability to inline methods was improved (especially for methods with value type arguments). Well now that this release is publically available my claim can be put to the test. In fact a industrious blogger named Steven did just that and blogged about it here. What he did was to create a series of methods each a bit bigger than the previous one, and determined whether they got inlined or not. Steven did this by throwing an exception and then programmatically inspecting the stack trace associated with the exception. This makes sense when you are trying to automate the analysis, but for simple one-off cases, it is simpler and more powerful to simply look at the native instructions. See this blog for details on how to using Visual Studio.
What Steven found was that when tried to inline the following method
public void X18(int a)
if (a < 0 || a == 100)
Throw(a * 2);
It did not inline. This was not what Steven expected, because this method was only 18 bytes of IL. The previous version of the runtime would inline methods up to 32 bytes. It seems like the JIT’s ability to inline is getting worse, not better. What is going on?
Well, at the heart of this anomaly is a very simple fact: It is not always better to inline. Inlining always reduces the number of instructions executed (since at a minimum the call and return instructions are not executed), but it can (and often does), make the resulting code bigger. Most of us would intuitively know that it does not make sense to inline large methods (say 1Kbytes), and that inlining very small methods that make the call site smaller (because a call instruction is 5 bytes), are always a win, but what about the methods in between?
Interestingly, as you make code bigger, you make it slower, because inherently, memory is slow, and the bigger your code, the more likely it is not in the fastest CPU cache (called L1), in which case the processor stalls 3-10 cycles until it can be fetched from another cache (called L2), and if not there, in main memory (taking 10+ cycles). For code that executes in tight loops, this effect is not problematic because all the code will ‘fit’ in the fastest cache (typically 64K), however for ‘typical’ code, which executes a lot of code from a lot of methods, the ‘bigger is slower’ effect is very pronounced. Bigger code also means bigger disk I/O to get the code off the disk at startup time, which means that your application starts slower.
In fact, the first phase of the JIT inlining improvement was simply to remove the restrictions on JIT inlining. After that phase was complete we could inline A LOT, and in fact the performance of many of our ‘real world’ benchmarks DECREASED. Thus we had irrefutable evidence that inlining could be BAD for performance. We had to be careful; too much inlining was a bad thing.
Ideally you could calculate the effect of code size on caching and make a principled decision on when inlining was good and bad. Unfortunately, the JIT compiler has does not have enough information to take such a principled approach. However some things where clear
1. If inlining makes code smaller then the call it replaces, it is ALWAYS good. Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different).
2. The more a particular call site is executed, the more it will benefit from inlning. Thus code in loops deserves to be inlined more than code that is not in loops.
3. If inlining exposes important optimizations, then inlining is more desirable. In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.
Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.
1. Estimate the size of the call site if the method were not inlined.
2. Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)
3. Compute a multiplier. By default it is 1
4. Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)
5. Increase the multiplier if it looks like struct optimizations will kick in.
6. If InlineSize <= NonInlineSize * Multiplier do the inlining.
What this means is that by default, only methods that do not grow the call site will be inlined, however if the code is in a loop, it can grow as much as 5x
What does this mean for Steven’s test?
It means that simple tests based solely on IL size are not accurate. First what is important is the Native size, not the IL size, and more importantly, it is much more likely to be inlined if the method is in a loop. In particular, if you modify Steven’s test so that the methods are in a loop when they are called, in fact all of his test methods get inlined.
To be sure, the heuristics are not perfect. The worse case is a method that is too big to be inlined, but is still called A LOT (it is in a loop) and calls other small methods that COULD be inlined but are not because they are not in a loop. The problem is that the JIT does not know if the method is called a lot or not, and by default does not inline in that case. We are considering adding an attribute to a method which gives a strong hint that the method is called a lot and thus would bump the multiplier much like if there was a loop, but this does not exist now.
We definitely are interested in feedback on our inlining heuristic. If Steven or anyone else finds real examples where we are missing important inlining opportunities we want to know about them, so we can figure out whether we can adjust our heuristics (but please keep in mind that they are heurisitics. They will never be perfect).