NGen Overview





NGen Primer

NGen Overview

I thought it would be useful to provide a
primer on the NGen tool and pre-jitting your code for performance reasons. 
In particular, there are some gotchas you must be aware of when authoring your
product.  In this entry, I’m going to cover some background material on
paging (which you can skip if you are an expert
already).  Then we’ll cover the workings of the

NGen tool
, some
servicing
implications, and finally some future
directions
.

Before we get started, let me keep up a Microsoft tradition
and include the key takeaways right here.  If
you get nothing else out of this topic or can’t read the whole thing, make sure
you absorb the following:

  • NGen is important to getting faster startup
    through better page sharing and working set reduction
  • NGen is primarily interesting for client
    scenarios which need the faster startup to be responsive
  • NGen is not recommend for
    Asp.Net because the assemblies NGen produces cannot be shared between
    App Domains
  • NGen for V1.0 and V1.1 was primarily designed
    for the CLR (where we’ve seem dramatic wins), and while it can be used for shared libraries and client apps:
  • Always measure to make sure it is a win for
    your application

  • Make sure your application is well behaved
    in the face of brittleness and servicing

Bottom line recommendation: 
keep your eye on the technology, experiment with it, but plan to
wait for a future version before really pulling it into your
application.

Paging Primer

Windows
uses a virtual address space on your machine, so for a 32-bit system you get
from 0 to 4GB of addressable memory for each process.  Windows code is typically compiled into a Portable
Executable file (PE file), which contains sections of code and data marked with
page attributes like read, write, and execute.  When the OS loads such a
file into a process, it maps the memory from your file into physical pages that
can be addressed by the process.  So far so good?

On the x86, calls to methods are typically in
the form of
"call address", where address is an absolute
value from 0 to 4GB, and tells the CPU the precise location it should transfer
to.  This poses a problem for the compiler, because it means that when the
user’s file is loaded, it needs to know precisely where all of the methods it
will call inside that file live (not just relative to the start of the file, but
the absolute address in the entire process).  There are two things that
kick in here to aid you:

  Base
Address

This is the address you
specify as a developer (either through your compiler (eg: /baseaddress
in VB.Net or C#) or using the rebase
tool) where you want your executable to be loaded.  The compiler
will now assume the file will get loaded there, and can now predict the
absolute address of every method in the file.

   

  Relocs

Just in case your file can’t be
loaded to that base address, say if someone is already loaded there, the
compiler will emit a set of relocs in the file that tell the OS where
absolute addresses are located in the image.  If the file gets relocated to
a new place in the process, the OS will now fix-up the addresses —
essentially adjust them to the new home of the code or data.  This allows
flexibility, but is also expensive; keep reading to find out why.

Besides allowing the compiler to stitch
together your program, a base address gives you a predictable location for your
file to get loaded every time it is executed.  This is important, because
if you have sections of the file (say all of your executable code) that are read
only, then we’d like to be efficient as possible on the machine and share those
pages between processes.  The OS accomplishes this if your pages are marked
for read only and sharable.  So if you have the code for strcpy from msvcrt.dll at
location 0x70124800, then the one page of physical memory where that code lives can be
viewed in all
of the processes on the machine that also need it, provided those process have
loaded the msvcrt.dll to the same address. 

User Process 1 Kernel Mapped Pages User
Process 2
               
               
               
               
               
               
               
               
Same shared page is mapped into
both user processes:
 
Writeable pages are unique per
process:
     
 
               
               
               
               
               
               
               
               

See the advantage?  Overall system memory
pressure goes down with shared pages because only one physical page is used no matter how many
times you load it.  Also, speed of loading code goes up, because chances
are the system already has that file loaded in some other process on the
machine.  This is typically referred to as a "warm startup", because the OS
has already loaded many of the pages you need, and doesn’t have to
go out to disk to get them.  So bottom line, sharing of pages between
processes is a GOOD THING.

I mentioned that having to relocate a file away
from its base address is a BAD THING for your shareable data.  This loss of sharing is the reason. 
If you cannot load at your preferred base address, then those addresses in those
otherwise sharable pages are now wrong.  So the OS has to make a copy of
the page for your process, mark it write, and then fix-up all of the invalid
values.  This is bad because it takes both more time to do this (slower load
times) and more space (for the extra unshared pages).

I should point out that some pages are, of course, intended
to be per-process.  Your global data for example wouldn’t make much sense
if you were sharing it with another running instance of your application! 
But in general we try very hard to reduce the number of pages in the system
because of the high cost of the extra memory pressure.

Back to Managed

Ok, all of this background is interesting, but
what does this have to do with Managed code and the CLR?  First, we also
use the PE file format for managed code, so your VB.Net application will be
stored in the same file format as kernel32.dll.  This allows managed
executables to appear anywhere you would normally expect.  For example if you want to do a CoCreateInstance on your managed code, or do a LoadLibrary directly,
you can do so. 
This file format choice means we have to follow the same rules for assigning base addresses.  And
guess what?  We made the metadata and IL your compiler generates read only +
sharable so we could use the same memory management benefits you get with
unmanaged code.

Now think about what the JIT compiler does for
a minute.  It just-in-time compiles your program one method at a time.  That
means we allocate, on the fly, some memory and write the necessary native code
for your program out to
that location.  When we need to call a method, we know where we put it in
the absolute address range, so we can do the same

"call address" you saw
for unmanaged code.  The advantage of the JIT is that it can literally
stitch your program together as you go, and it only compiles the code that
you actually execute.  But since this is happening on the fly, all of those pages where this code is allocated
are for that process only.  We get none of the sharing advantages
you got with unmanaged code in read only + sharable pages, and it also takes
time to run that compiler.  We did some experiments early on in the Runtime
as proof of concept for our managed C++ compiler which included recompiling Word as an
IL image.  It worked great!  But it was slow.  Office is a big
application, and using the JIT for this case didn’t put our best foot forward.

User Process 1 Kernel Mapped Pages User
Process 2
               
               
               
               
               
               
               
               
Shared pages are mapped into
both user processes:
 
Writeable pages are unique per
process:
     
JIT’d code lives in writable
pages:
 
 
               
               
               
               
               
               
               
               

Wouldn’t it be great if you could get the same
page sharing advantage as unmanaged code, and not have to run the JIT every time
for a big application like Office?  That’s the NGen tool, and we’ll drill
into that in the next section.

NGen Tool OverView

NGen stands for "Native Image Generator". 
The tool allows us to run the JIT compiler on all of your IL in an assembly (a
PE file) at one sitting, and cache the results out to disk.  Now when you
want to load and run that assembly, we can find it in the cache and load it just
like an unmanaged image.  Because the code is read only + sharable, you get
the same benefits of page sharing. 

So what precisely is in that image that gets created? 
Let’s look at the contents:

Header

All PE files contain the standard set of headers,
and an NGen image is no different.

Native Code Obviously this is the key thing we are trying to get into the image, and
does make up the bulk of the image size.  The code persisted is
100% native at this point, so that the JIT does not need to get involved
to execute it.
No Metadata or IL The current NGen produced image does not have a copy of the metadata or
the IL in it.  This is significant, because it means that you will
need to have both the original IL Assembly and the NGen
image loaded at once.  In general we try to avoid touching the metadata and IL
at runtime, but you can’t always avoid it.  Two examples are late
bound programming (eg: Reflection, which needs name information form the
metadata) and JIT’ing of non-NGen’d code (the IL is read to see if it
can be inlined).
Fix-up Tables The CLR requires more than just code to execute.  It must have
access to key data structures which describe things like Class and
Method layouts.  These are only known at runtime.  We want to
reduce overall writable pages in a process.  To accomplish this,
NGen stores a table of pointers to this data which will be allocated at
run time.  This allows NGen
to generate one version of the code that will work unmodified for all
processes, because there is a predictable location in the image where
you can find the pointer to the dynamically allocated data (essentially
a slot in this table).  However, this technique has the down side
of (1) slowing down startup to fill out the table, and (2) generating
sub-optimal code which must use a pointer indirection to get the data it
needs.  Finally, it also
means that we cannot simply persist the output of the JIT compiler
itself while it runs, because the actual code generated is different in
the two scenarios.

Even with some of the trade
offs mentioned here, we’ve seen some remarkable performance
wins from this technique (and it only gets better each new release).  There are, however, some
considerations you need to make before you jump on board the NGen bandwagon. 
We’ll cover those now.

Performance Win?

Measure, measure, measure.  You
should always verify that this is a win for you.  First, you should
be writing either a shared library (like the BCL itself) or a client
application that would really benefit from this kind of win.  You
must
go try your app with and without to make sure it is worth the
effort.  It may not always be.  For example in a Server scenario,
where the application runs a long time, you can amortize the
cost of jitting over the run of your server.  Combine that with
lack of sharing across AppDomains and NGen isn’t a win.

To Cache or Not to Cache

Generating an NGen image takes time. 
You will be compiling all of your code at once into the final binary.  The larger the file, the longer this takes.  We
do this for the .NET Framework during installation, and you can
see the pause.  In our case it makes sense: all of your
applications will run that much faster because of this.  You should
decide if your application can handle this kind of wait.  You may
not want to do this for dynamic web content in a browser for example. 
Who wants to wait for the compile to finish for it to come up once? 
And will you ever run the same program as is again?

Brittleness

The

MSDN documentation
gives you the command line arguments and usage of
the tool (which comes with the distribution).  You should read very
carefully through the section on brittleness.  As an example, the
ngen’d image is tightly coupled to the version of the Framework you
compiled against.  If that version is serviced (we ship a Service
Pack for example), then your image
will not be loaded, and your application will automatically fall back to jitting. 
To be clear:  your code will still run, but it will not take
advantage of the speed improvements you measured.  This is
something we are spending a lot of time addressing in the
next version
of the product.

The NGen To Do List

So you’ve decided to ngen your image.  Now
what?  This section contains some steps you should be taking:

Start with MSDN Make sure you read all of the
documentation on

MSDN
.  I will only pull out highlights here.

Picking a Base Address

Pick a good set of base addresses for
your PE files.  The NGen’d image will get placed right
behind your IL image in the process.  You need to allocate enough
space between your IL PE files for this image to be loaded.  The
general guideline is to allocate at least 3x the original size of the IL
image (so for
example if your IL assembly was 1 MB large, you should allocate a 3MB
total range for that assembly plus it’s ngen’d image). You
should take a look at the size of your NGen’d images and verify you have
enough space, not only for what you ship, but for some reasonable amount
of growth if you ship a bug fix release of the file.

When to NGen?

You need to pick when you want to
invoke the tool.  For the distribution, we invoke ngen as a final step
during setup.  This is the best approach in most cases,
because your application will start fast from the first time it is run. 
However, this will consume space on the user’s machine, so if you think a
particular application, or component, that you ship may not be run often (or at all),
then you might consider deferring ngen to when the application starts
the first time.  For example, you could schedule a windows timed
task to compile it at night after the first time the code is run.

Servicing

When you release bug fixes to
customers in your managed code, you will need to regenerate the ngen’d
images as well.  This is pretty simple to do, just run the ngen
command again.  But you need to make sure it is covered with the
setup/patching feature you are shipping.

Uninstall

Remember to use ngen /delete to
remove your unneeded assemblies from the cache when you uninstall your
application.  Currently the CLR will remove all assemblies tied to
a version of the framework on uninstall of the .NET FX, but it doesn’t
try to figure out when you’ve uninstalled just your application.

Servicing Hints

As mentioned above, there are brittleness
issues with ngen in V1.0 and V1.1 (aka Everett).  So you need to plan out
what you will do in the face of those things changing.  As an example, we
will release a service pack of the CLR at some point, and your cached ngen
images will no longer load.  Your code will still work, but it will run
under the jitter which will be slower (you did measure to verify you needed ngen,
right?).

Right now fixing this is tricky.  Expect
us to improve this situation in the future, but for
now, here are some ideas on how you can address this:

Setup/Patching

Make sure your setup and patching
programs are doing the right thing.  If you ship a fixed version of
your IL code, you need to re-run ngen on those files for it to be up to
date.

Poor
Man’s Service
You can periodically
run a scheduled task to check your images and re-ngen them as required. 
If you already have some kind of nightly enterprise script running on
client machines, as an example, this would be a fine time to do
maintenance.  Note:  if your images are already up to date,
the NGen tool will simply report that and exit instead of doing a lot of
unnecessary work.
Rocket Science If you are really motivated, you could go find the
list of natively loaded PE files in your process (use the Win32 PSAPI
API or walk the PEB) to see if your NGen’d image was actually loaded in
the process.  If it wasn’t, most likely it means you need to fix it
up, and your app could do so for the next run.  I might prototype
this at some point, but suffice to say it isn’t a trivial thing to do.

Future Directions

At this point you’ve probably looked through the list of
Servicing Hints and thought to yourself: 
"Wow that’s kinda ugly!"  And you’re right.  NGen for Version 1.0 and
1.1 was primarily designed and engineered for internal use by the CLR itself. 
When we install SP’s of our stuff, we force a re-ngen of all of the core
components, which keeps that part of your app running fast.

Going forward, Ngen is still a
key foundation for our performance story.  It gives you the working set wins (better page
sharing, quicker loading) that are required for starting your application
faster.  It also allows for more aggressive optimizations in the compiler. 
If we tried doing really aggressive optimizations every time you ran the JIT,
you’d actually run slower just waiting for the compiler to finish. 

Expect in the future that we will be addressing the
clumsiness and the servicing issues so your life is easier.  Here just a
few things we’re thinking about:

ngen /repair

We’ll be
talking about a feature called "ngen /repair" at the October 2003 PDC
in LA next month which dramatically simplifies fixing up the cached
images.

New API’s There are some cleaner ways we could expose the fact that your
application is out of date.  It would make it simpler to write your
app if it could query this state, or force it to correct automatically. 
We are considering these designs now.
Double Loads As mentioned above, the current CLR loads both the IL image and your NGen
image in the process.  This double loading is inefficient
because it makes the OS loader do more work (slower startup time). 
Look for us to try to avoid this in the future.
Indirections As mentioned above, NGen images still contain a lot of fix-up tables for
dynamic data structures.  This causes the startup to be slower
(while those tables are fixed up) and generates sub-optimal code which
must use the indirection of the table to get at the data.  Look for
us to get more aggressive and avoid a lot of this.

And finally in closing, make sure to
re-read those key take aways.

More Information

There are some important links you may be interested in
reading:



MSDN Documentation


Gregor’s Perf Talk


Jan’s Perf Talk


Rico’s Perf Talk



Comments (18)

  1. Andy says:

    Very informative article. I never realized that there was so much to NGen. I just thought that it was doing the same thing as the CLR when it JITs, only not at the app’s runtime.

    I look forward to more articles like this from you. This was a very good post. 🙂

  2. Thanks Jason that was a great post 🙂 Would love to see some posts on the Rotor JIT, how it compares to the standard JIT and what issues are involved with the JIT when porting the PAL.

  3. Jason says:

    Good suggestion, we should definately be able to provide that kind of content. thanks!

  4. Good stuff ! Would love to see articles and posts like this, maybe these could be hosted on http://www.sscli.net (I do think it should have a Rotor RSS aggregator). Along with a few others I am currently doing some research on a Rotor port (hench my interest 🙂

  5. Congratulations for this great article. There are loads of good information here J
    You mentioned at one point that earlier you compiled the Word in IL but it was slow. Did you also ngened it? If yes how big was the performance gain?

    Thanks!

  6. Jason Zander says:

    Good question Krisztian. At the time we did this experiment in V1.0, NGen didn’t actually exist (we were designing and writing the tool then). So we did not do that experiment, and have not since tried recompiling MS Word. We have, of course, tried numerous other applications to gain our data. If we do try it again, I’ll post results here for folks to look at.

  7. Randy says:

    I wish that MSDN pointed to this article when they cite that "pre-compiling a Windows Forms application" can lead to faster startup times. They just mention that it can help without any of the considerations that I found here… (I’ve since taken it out of my installation procedures, even though it did offer faster start time for my applications – it didn’t seem worth the risk to an upgrade path to future versions of the Framework)

Skip to main content