To Inline or not to Inline: That is the question


In a previous posting, I mentioned that .NET V3.5 Service Pack 1 had significant improvements in the Just in time (JIT) compiler for the X86 platform, and in particular its ability to inline methods was improved (especially for methods with value type arguments).   Well now that this release is publically available my claim can be put to the test.   In fact a industrious blogger named Steven did just that and blogged about it here.   What he did was to create a series of methods each a bit bigger than the previous one, and determined whether they got inlined or not.   Steven did this by throwing an exception and then programmatically inspecting the stack trace associated with the exception.    This makes sense when you are trying to automate the analysis, but for simple one-off cases, it is simpler and more powerful to simply look at the native instructions.  See this blog for details on how to using Visual Studio.


What Steven found was that when tried to inline the following method


        public void X18(int a)


        {


            if (a < 0 || a == 100)


            {


                Throw(a * 2);


            }


        }


 


It did not inline.  This was not what Steven expected, because this method was only 18 bytes of IL.  The previous version of the runtime would inline methods up to 32 bytes.   It seems like the JIT’s ability to inline is getting worse, not better.   What is going on?


Well, at the heart of this anomaly is a very simple fact: It is not always better to inline.  Inlining always reduces the number of instructions executed (since at a minimum the call and return instructions are not executed), but it can (and often does), make the resulting code bigger.  Most of us would intuitively know that it does not make sense to inline large methods (say 1Kbytes), and that inlining very small methods that make the call site smaller (because a call instruction is 5 bytes), are always a win, but what about the methods in between?  


Interestingly, as you make code bigger, you make it slower, because inherently, memory is slow, and the bigger your code, the more likely it is not in the fastest CPU cache (called L1), in which case the processor stalls 3-10 cycles until it can be fetched from another cache (called L2), and if not there, in main memory (taking 10+ cycles).  For code that executes in tight loops, this effect is not problematic because all the code will ‘fit’ in the fastest cache (typically 64K), however for ‘typical’ code, which executes a lot of code from a lot of methods, the ‘bigger is slower’ effect is very pronounced.  Bigger code also means bigger disk I/O to get the code off the disk at startup time, which means that your application starts slower. 


In fact, the first phase of the JIT inlining improvement was simply to remove the restrictions on JIT inlining.   After that phase was complete we could inline A LOT, and in fact the performance of many of our ‘real world’ benchmarks DECREASED.   Thus we had irrefutable evidence that inlining could be BAD for performance.  We had to be careful; too much inlining was a bad thing. 


Ideally you could calculate the effect of code size on caching and make a principled decision on when inlining was good and bad.   Unfortunately, the JIT compiler has does not have enough information to take such a principled approach.   However some things where clear


1.     If inlining makes code smaller then the call it replaces, it is ALWAYS good.  Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different). 


2.     The more a particular call site is executed, the more it will benefit from inlning.  Thus code in loops deserves to be inlined more than code that is not in loops.


3.     If inlining exposes important optimizations, then inlining is more desirable.  In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.


Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.


1.     Estimate the size of the call site if the method were not inlined.


2.     Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)


3.     Compute a multiplier.   By default it is 1


4.     Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)


5.     Increase the multiplier if it looks like struct optimizations will kick in.


6.     If InlineSize  <= NonInlineSize * Multiplier do the inlining. 


 What this means is that by default, only methods that do not grow the call site will be inlined, however if the code is in a loop, it can grow as much as 5x


What does this mean for Steven’s test?


It means that simple tests based solely on IL size are not accurate.   First what is important is the Native size, not the IL size, and more importantly, it is much more likely to be inlined if the method is in a loop.  In particular, if you modify Steven’s test so that the methods are in a loop when they are called, in fact all of his test methods get inlined.   


To be sure, the heuristics are not perfect.  The worse case is a method that is too big to be inlined, but is still called A LOT (it is in a loop) and calls other small methods that COULD be inlined but are not because they are not in a loop.   The problem is that the JIT does not know if the method is called a lot or not, and by default does not inline in that case.   We are considering adding an attribute to a method which gives a strong hint that the method is called a lot and thus would bump the multiplier much like if there was a loop, but this does not exist now. 


We definitely are interested in feedback on our inlining heuristic.  If Steven or anyone else finds real examples where we are missing important inlining opportunities we want to know about them, so we can figure out whether we can adjust our heuristics (but please keep in mind that they are heurisitics.  They will never be perfect). 


 

Comments (21)

  1. Pon says:

    I think more JIT hints DEFINITELY need to be added. Some of the main ways to influcence JIT compilation that would be:

    * A number of inlining attributes, such as the MSVC __forceinline keywords, would allow us to have better control and optimise for when we know better than the JIT.

    * An attribute saying how much we want the JIT to optimise. If I have code that’ll run in a tight loop (such as the AI logic in a game), and I need to squeeze every cycle out of it, I can afford an extra second or two during a load screen for the JIT to spend optimising it. The attribute could be an enum of optimisation levels, or a time threshold in milliseconds for which to spend optimising, or something similiar. This could also apply to NGEN.

    * Following on from the point above, I’m not sure if something like this exists yet, but provide a method that would ensure a method is JITted, similiar to the RuntimeHelpers method that calls class constructors. This, combined with an optimisation level as an overloaded argument, would be very powerful.

  2. jim_arnold says:

    -1 for adding attributes/hints.  I don’t want my code littered with compiler directives!  How about using adaptive optimisation which makes improvements as the code runs?  I would rather the compiler figure out what’s *really* happening at runtime than force a developer to describe what they *think* will happen.

    Jim

  3. rbirkby says:

    So I assume from what you said that over a real-world application, .Net 3.5SP1 inlines fewer methods than .Net 3.5RTM?

    I was under the impression that the .Net 1.0 JIT stopped at 32 bytes of IL due to the time required to analyze the method to see whether it was a candidate for inlining or was too complex? So JITted code size is now the metric to be concerned with instead of IL complexity.

    > We are considering adding an attribute to a method which gives a strong hint that the method is called a lot

    > and thus would bump the multiplier

    That would be a gross hack. Other VMs handle this through tracing JITs.

  4. curth says:

    How does this heuristic differ from the one before the service pack?

  5. CuttingEdge says:

    Thanks Vance for your fast response and letting us know what’s happening under the hood.

    >> Increase the multiplier if it looks like struct optimizations will kick in.

    Can you tell more about the heuristics of these struct optimizations? I’m very interested in them!

  6. vancem says:

    The inlining heuristic used before was relatively simple.   Most things under 32 bytes of IL got inlined IF they did not hit limitations in the linliner.    One sigificant lmitation is value types and another was that only one conditional branch was allowed and no loops.   There were other limitations that occur more rarely…

    Those limiations are gone now (and in fact we are relativley agressive about inlining value types now).  I think it is fair to say that we tend to inline less then we did before outside of loops, but significantly more that we used to if we are in a loop.

    Vance

  7. stephan_ says:

    I’m a bit curious what the "typical code" looks like you test CLR performance with. I’d guess that it’s mostly the kind of application where performance isn’t that important in the first place. I’ve yet to see any computational intensive code (e.g. simulations, numerical linear algebra, text/media processing) where (even forced) inlining actually had a negative effect. Improved code locality and new optimization opportunities almost always trump any negative effect of the increased code size.

    I don’t understand why some people argue against attributes that could hint to the JIT that a developer really, really wants a particular method inlined or that the JIT should spend extra time optimizing a particular method. If you find attributes ugly, don’t use them. If you have concerns about their security or about performance when executed in the web browser of your mobile phone, disable them (with an appropriate global option).

    The currently employed heuristics are far from perfect. On the other hand, manually measuring whether inlining improves performance is a trivial task. So why not give the users the choice?

    Now, if you took performance really seriously you would give users the option to compile their CLR code with a full-blown native-code compiler, ideally one supporting profile-guided optimizations (think Ngen with the optimizer back-end from Visual C++).

  8. Pon says:

    stephan++.

    I am under the impression that Phoenix will soon be able to (or already can) do PGO with .Net assemblies. Combine this with its awesome MSIL to native compilation, and that’s pretty much what stephan describes.

  9. indi says:

    I’m all for the compiler deciding minor inlining elements, but there are some damn good reasons to allow the programmer to force an inline. I understand that some people are basically stupid and think inlining is always a win and will tag everything, but I don’t think you should make a language for the lowest common denominator. I know my program far better than the compiler or the JIT, and there are cases where I really do want it to inline no matter what the size! Here’s a couple of examples…..

    First, if I have a 10 or 20 largish functions that basically have the same code internally then splitting these out to a separate method makes sense from a maintenance point of view. It means I only ever have to maintain one function and not 10 to 20. Cut and paste errors are the root of many bugs and are easily avoided.

    Second, If I have a common interface that I want users to use but know its going to be called hundreds of thousands of times, then being able to force the inline way past your current multiplier is very desirable. If I am making geometry in a reasonably large loop, but each position, normal, colour, texture coordinate etc. is being set via an interface, then the function itself might be just to large to inline with your current ideal values. Yet I KNOW that its going to be run 100,000 times or more, so I really DO want it inlined.

    This is true of many other methods. If I have a vector class with intersection tests, then I can either cut and paste the code to inline (causing errors later when I forget to update them), or take the hit of a million calls inside my ray tracing or physics function (or something).

    The other thing I would say is that inlining should happen at the IL level. it currently looks like you compile the code, then inline it. This means register usage is not maintained, so if I have 2 functions that need EAX to be a certain pointer, then are reloaded each time causing the code to be double the size it should be. (load EAX, load value, store to EAX +0, then load EAX, load value, store to EAX +4 etc.)

  10. static T whatever = null;

    public T Whatever {

     get {

       if (whatever == null) {

         // loads of code

         whatever = …

       }

       return whatever;

     }

    }

    It might give a nice boost if the JIT were able to inline everything but the if statement’s body. Because this is what usually gets executed, except the first time. I have no numbers that could indicate how large an impact this would make. But I do know that we use "globals" via ThreadStatic scopes a lot. There are few of them, but those get executed all the time. I bet a lot of LoB apps are like that (application/transaction context etc).

    I leave it to more gifted people to figure out how this could be detected. (structurally? runtime profiling?) Just an idea.

    Would it even make sense to take a similar approach as TraceMonkey in the CLR? Or is the JIT so fast already that tracing would introduce an disproportionate overhead?

  11. Thinking about it, it should be quite simple to manually structure those methods so that the default inlining strategy does just that. Maybe this is just an awareness thing.

    if (whatever == null)

     InitializeWhatever();

    return whatever;

  12. rklaehn says:

    First of all, the inlining issue is not yet fixed for x64: see my comment in this feedback item: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93858

    With respect to the heuristics: I think the heuristics make sense, but there should be some way to override them with an attribute. That might not be the most elegant solution, but some of us need that performance NOW and not in 5 years when a profile-guided optimizer for the CLR JIT is available.

    Another problem is that the code generation of the JIT is just horrible. A major point of inlining is that it enables subsequent optimizations. But when you look at the generated assembly code, you often see many completely redundant instructions.

    For example, in floating point code you often see completely unnecessary pairs of stores/loads in tight loops like this:

    0000001b  fstp        qword ptr [ebp-10h]

    0000001e  fld         qword ptr [ebp-10h]

    Not only does this destroy the performance, but it also makes the native code larger and thus prevents the inlining heuristics from kicking in.

  13. rklaehn says:

    One more thing:

    loops are not the only way for a piece of code to be called multiple times. For example if you do some kind of tree traversal usually the best way to do it is recursion.

    I think a heuristic should be added that if a method calls itself recursively, the same boost as with a loop applies. You need this information anyway for doing tail call optimization, right?

  14. Meazza-mFrog says:

    在.NET平台里,大部分编译器的优化并不是通过VB和C#编译器来完成的。它们宁可把优化的处理推后到CLR的即时(JustInTime,JIT)编译器读取IL,并转换为原生机器码的时候来完成。由于这…

  15. RawMan says:

    在.NET平台里,大部分编译器的优化并不是通过VB和C#编译器来完成的。它们宁可把优化的处理推后到CLR的即时(JustInTime,JIT)编译器读取IL,并转换为原生机器码的时候来完成。由于这…

  16. Eric Cosky says:

    I would very much like some control over inlining. As indi mentioned above, aiming for the lowest common denominator is not a good thing – it will only prevent C# from becoming more powerful and more widely adopted. Having default behavior be sane & "generally good" is a good thing. Having the ability to override default behavior when it really matters to the application is very important. For instance right now I’m realizing I’ll need to quite literally cut & paste some matrix math code into a core loop just to save some significant cycles in a particular function that is high on the profile results. I don’t care if this takes a little more memory and risk cache hits elsewhere; I would much prefer to have the option to decide that myself than to be protected from myself. When programmers are at the point of analyzing IL and spending hours doing profiling it’s time to allow them to take off a bit of the safety nets and see for themselves what delivers the best performance.

    PS While we’re at it please talk to someone about adding a standardized vector register friendly math library :) Games need it, badly, not just for XNA but any C# game running in standard Windows applications or XBAPs.

  17. joebarthib says:

    Thank you for this article Vance. Could you explain me the following surprising behavior:

    this method is NOT inlined

    public static float ConvertCoordinateFromDegreeToMm ( float coordinateValue, float radius )

    {

       return coordinateValue * radius * DEGREE_TO_RADIAN_COEF;

    }

    whereas this one is!

    public static float ConvertCoordinateFromDegreeToMm ( float coordinateValue, float radius )

    {

       float result = coordinateValue * radius * DEGREE_TO_RADIAN_COEF;

       return result;

    }

    I’m using VS 2008 pro, .Net 3.5 SP1 on 32 bit Win XP pro, and my cpu is a core2 duo P8600. I’ve tried calling the method from a loop or not, and with different sizes of the calling site, I get the same results. I’m checking inlining within VS disassembly window, having unchecked "Suppress JIT optimization on module load".

    Thank you for your answer!

  18. Eugene says:

    I'm surprised the following is not being inlined unless using the 4.5 attribute AggressiveInlining:

       static int Main(string[] args)

       {

           return DoSomething(10000000);

       }

       static int DoSomething(int repetitions)

       {

           int trues = 0;

           for (char i = (char) 0; i < repetitions; i++)

           {

               if (IsControl(i))

                   trues += 1;

           }

           return trues;

       }

       public static bool IsControl(char c)

       {

           return ((c >= 0 && c <= 31) || (c >= 127 && c <= 159));

       }

    N.B. This is derived from stackoverflow.com/…/709537 , adding the loop (to boost likeliness of inlining) and actually making use of the return value (to prevent complete elision).