Does the JIT take advantage of my CPU?


 


Short answer is yes. One of the advantages of generating native code at runtime is that we know what processor you are running on and we can tune the code accordingly. Why would we do that for x86? Every generation of x86 processors has its own personality. Their personality comes usually in 2 ways:


 


          New instructions: For example, SSE and SSE2 instruction set


          New ‘moods’: For example, Pentium 1 wanted programmer to schedule instructions by hand (in order to fill its 2 execution pipes), changes in branch prediction logic, P4s trace cache, etc, and even in the form of ‘regressions’, such as P4 preferring ADD REG, 1 vs INC/DEC instructions (which were very frequent instructions in tight loop code).


 


Also, AMD processors have their own personality, although in my experience, AMD’s are much more predictable and ‘well behaved’, and thus, need less work.


 


Note that we just don’t jump and implement in the JIT functionality to take advantage of every processor difference. The process is usually identifying something that is hurting in one of the benchmarks or user scenarios we track, evaluate the cost of the fix, the risk of the fix (every time we make a processor specific optimization, we make the life of our test team a bit harder and it’s easier to have x-proc only bugs, which are a bit harder to track down) and then, once we have all that data, we make a decision.


 


Examples of some of the processor specific optimizations in x86 (none of these will be a big surprise for developers that do machine level programming, all of these optimizations are called out in big fonts in the processor optimization manuals).


 


          Use of CMOV instruction when available (enables conditional moves, which is very useful in branches that are taken in a random (ie, non predictable by the processor) fashion


          Use of FCOMIx family of instructions (makes floating point comparisons much cheaper)


          Use of SSE2 for memory copies (memory copies are fun, you would expect something that simple to be always, the same, but I’ve witnessed 4 ‘recommended’ ways of doing it during the time I’ve worked with x86: Use string instructions (REP/MOVS/STOS), Use floating point registers for the move, Use scalar instructions (to get better pairing/parallelization) and now use SSE2). Note that we don’t use SSE2 for floating point code. The reason for this is that we don’t vectorize code (which is the real win with SSE2), SSE2 for scalar floating point is not always a win compared to the x87 (different latencies between instruction sets for adds and multiplications makes each one of them better than the other depending on the scenario) and some things like converting doubles to floats was really slow on SSE2, so we decided investing in making our x87 code better (which we were going to have to support anyways).


          Use of SSE2 for floating point to int conversion.


          Other minor instruction selection differences (such as avoiding INC and DEC instructions in hot code for P4 or to avoid store forwarding problems in P4 and Centrino processors)


 


We don’t take advantage of other things, such as knowing code cache sizes, etc… One of the reasons for this is that we don’t want different code on every single machine out there. As usual, there is a trade off if we did this, we may get some extra speed in some situations, but on the other hand, in  a realistic world, it’s more likely for us to produce bugs that only repro in machines that meet n conditions, so introducing more processor specific optimizations has to be done carefully.


 


What about NGEN?


 


NGEN is an interesting case. In previous versions of the CLR (1.0 and 1.1), we treated NGEN compilation the same way as we treated JIT compilation, ie, given that we are compiling in the target machine, we should take advantage of it. However, in Whidbey, this is not the case. We assume a PPro instruction set and generate code as if it was targeting a P4.  Why is this? There was a number of reasons:


 


          Increase predictability of .NET redist assemblies (makes our support life easier).


          OEMs and other big customers wanted a single image per platform (for servicing, building and managing   reasons)


 


We could have had a command line option to generate platform specific code in ngen, but given that ngen is mainly to provide better working set behavior and that this extra option would complicate some other scenarios, we decided not to go for it


Comments (14)

  1. Shaun Bedingfield says:

    As far as I know, the 64 bit ngen is done by the Visual C++ time and does take advantage of the additional time offered.

  2. davidnotario says:

    Yup, 64 bit NGEN does more than their JIT compiler, but it comes at a price (extended NGEN time). I also don’t know what’s their story regarding different CPUs (do they generate code for the different versions of Itanium? etc…).

    I didn’t comment on 64 bit because I work on x86 JIT 😉

  3. CN says:

    Have you performed any NGEN tests on old P3s (and P4s), comparing 1.x to Whidbey? Not having SSE2 memory copies and rearranging some integer operations assuming a P4 sounds like things that could hurt performance critical code badly. Is there any odd case you’re aware of that is worse off because of this? Enough to merit avoiding NGEN of code that performs a lot of memory copies, for example?

  4. Thanks for sheading light on this topic.

  5. Chris Nahr says:

    Very interesting post! Another question — are you making use of the fact that NGen can spend more time than the JITter in order to provide more thorough optimizations?

  6. David Notario illustra in un

    recente post alcune strategie seguite per adattare…

  7. davidnotario says:

    CN: With something as complex as a JIT you will always find an odd case. No, we haven’t seen anything outrageous. As always, you will have to measure. In general, our guideline for x86 is that if you want throughput, use JIT, for startup / code sharing, use NGEN.

    Chris Nahr: x86 JIT doesn’t take advantage of that. 64 bit ngen does.

  8. akraus1 says:

    I see in the WINNTassemblyNativeImages_v2.0.50215_32

    directory the Ngened executables but no big difference in size. What does NGen really do?

    I thought the images would be compiled to native code to a certain degree but obviously not. At the moment I do monitor the managed/unmanaged target sizes at our department. It is very hard to tell from the managed dll size vs unmanaged dll size how much effort in terms of lines of code went into it. Do you know of any widely accepted factor which normalizes the managed dll size with the unmanaged dll size so you can compare the efforts?

  9. Interesting blog post by someone on the C# compiler team.

    Two points I found most interesting:SSE2…

  10. Interesting blog post by someone on the C# compiler team.

    Two points I found most interesting:SSE2…

  11. David Notario illustra in un recente post alcune strategie seguite per adattare il codice generato dal

Skip to main content