Cold Startup Performance


I received several requests to write a little something on using managed code in a cold startup case – by which I mean immediately, or at least soon, after a reboot.  I guess before I get into that I should give my usual disclaimer that I’m not going to try to be perfectly correct in my exposition in the name of being remotely brief.

So here goes.

It’s often said that in the performance world “space is king” – meaning that if you make your code small it will naturally tend to be fast because of locality improvements and whatnot.  However certain or doubtful of this “fact” you may be for normal “warm” cases it’s surely the case in the cold cases.

In cold startup it is not likely that any of the CLR components are yet in the operating system’s disk cache – you’re going to be doing real I/O to bring those in.  Some of that will be batched up nicely by the operating system but once some core set of pages are loaded the rest will start to fault in.  Those faults will be “hard faults” meaning you’re going to go to the disk.

This is a peculiar situation because in such a world, processing costs tend to almost vanish in comparison to all of the disk i/o you will be doing.  Even seemingly daunting tasks like jitting up a goodly bit of code may be overwhelmed by the disk i/o.

It’s a whole new ball-game.

The best way to get a handle on your costs then is to switch from measuring processor time, or even wall-clock time, and just measure the i/o you are doing.  Which pages do you need to bring into memory and from which dlls will they come?  What other files will you read?  All of the I/O for those files will also cost you. 

Seemingly innocuous reads of configuration files can move the disk head around slowing down other reads and potentially dragging in a lot of parsing code (more reads).  Remember that in cold start i/o is at a premium so anything you can defer until after startup, when the disk has otherwise settled, is a great idea – give the disk scheduler every chance to get the right pages in the right order.  If you defer the initialize of some subsystems you save directly by not touching the code and indirectly by not looking at any registry entries it might need (that’s I/O too).

Now let me get back to the JIT phenomenon.  In warm startup cases, if you are loading from ngen’d images those would be coming from the disk cache, and the cost of loading those pages into your process is fairly low compared to jitting.  Furthermore many of those pages can be shared across processes so we like to encourage putting sharable code into ngen’d images – jitted code can’t be shared.  In cold startup things are different.  The IL is smaller than the native code so it may in fact be cheaper to load the IL and JIT it than it would be to do disk i/o for the prejitted code.  Of course you probably don’t want to do this for code that is likely to be needed in other processes (like mscorlib) because the cost of the loading is amortized.  But if your process has a significant amount of application specific code and cold startup is paramount to you then jitting may be more attractive than it first seems.

Of course… you have to measure to be sure of anything.

Recap:

  • Cold startup time will be dominated by disk i/o
  • Consider the size of code you are loading – defer what you can
  • Consider initialization files and the code to parse them – defer what you can
  • Consider collateral operating system resources like there registry – avoid what you can

Managed code startup tuning isn’t really all that different from unmanaged tuning – the real issue is that in managed code everything is easier, even dragging in some huge DLL with a couple lines of C#.  So be careful out there

See also: http://blogs.msdn.com/ricom/archive/2004/04/22/118422.aspx for some general Performance Planning tips.

Update: As it happens a colleague of mine just did some experiments with a test application using a recent build of Whidbey and some of the Avalon dlls showing some of the kinds of effects you could see if you choose to ngen more or less.  And yes I thanked him profusely 🙂

Environment: These tests were run on a 1GHz PIII with 512Mb RAM on XP/SP2. The machine was off the network, anti-virus software was turned off, and some other services that seemed unneccesary were also turned off.  This is a fairly clean situation.

Scenario: The test application loads three FX assemblies (mscorlib, system, and system.xml), five Avalon assemblies and two application specific assemblies.

Scenario   Average Time     Methods Jitted   

Samples

1. Nothing ngen’ed 43.302s   6279   44.183, 43.262, 42.461
2. Only mscorlib ngen’d 48.356s   4882   48.559, 48.199, 48.309
3. All three FX ngen’d 43.733s   4646   43.292, 44.063, 43.843
4. Everything ngen’d 29.174s   0   31.124, 28.350, 28.050

This data indicates ngening can hurt or help cold startup depending on the particular scenario so it’s rather hard to give general guidance — you’ll have to measure your own specific scenario. Notice how ngening mscorlib alone actually hurts cold startup and ngening all three fx assemblies seems to be the break even point for this scenario, after which things start to improve.  Even armed with this data though, the problem is more subtle because of course there is generally not just one thing happening at startup and you might want to look ahead to what comes after — preloading some of the the framework will have collatoral benefits for things that may run shortly afterwards (or not so shortly) and those effects should not be underestimated.

And don’t forget… the warm startup case will have totally different dynamics…

Putting me further in debt, here’s another update with the equivalent warm numbers (same setup): 

Scenario (WARM)   Average Time     Methods Jitted   

Samples

1. Nothing ngen’ed 13.071s   6279   13.098, 13.048, 13.068
2. Only mscorlib ngen’d 11.546s   4882   11.546, 11.536, 11.556
3. All three FX ngen’d 10.735s   4646   10.725, 10.745, 10.735
4. Everything ngen’d 2.543s   0   2.553, 2.553, 2.543

Big difference…

 

Comments (23)

  1. Matt says:

    Wouldn’t registry access be fairly low cost (at least in comparison to config files) given it is widely used by the OS and other processes? Surely a substantial part of the registry would be cached by the time any managed code is started?

  2. Rico Mariani says:

    Maybe, maybe not. We are talking about cold startup after all so presumably not too much has run. And the registry is not created equal — the portions you need may or may not be the same as what the operating system already has accessed.

  3. Saurabh Jain says:

    Does the sharing on ngen’d images happen across concurrent login sessions? Consider the case where a school has a server on which various students logon during a lab. Does it make sense to ngen everything to get maximum sharing and reduce memory use?

  4. Rico Mariani says:

    The native image cache is machine-wide so yes multiple users can share pages from the same images.

  5. Rico Mariani says:

    Updated the main body with some data from a colleague.

  6. Saurabh Jain says:

    Thanks Rico. The follow up question is, does the sharing of ngen images happen if the same dll (strongly signed) is installed at two different locations.

  7. The data you give is not encouraging. How big were these dlls being loaded? It is proportional to size, generally speaking? What is the rough delta for a 350K assembly?

    Thanks.

  8. Rico Mariani says:

    Hmmm… I was going to say "no" off hand but then it hit me that it’s more subtle than that. If the strong name is the same then you’d expect the native image to land in the same place in the native image cache — it wouldn’t matter which flavor of IL you tried to load, both would use the same native bits so there would be sharing. So I double checked that last assumption and sure enough that is in fact the case — the same native bits are used even if you get the (same) IL from different locations.

  9. Rico Mariani says:

    Frank: The cost isn’t going to be proportional to size necessarily although size will be a leading indicator. It’s a function of how much of the DLL you touch at startup and how you touch it. If you dance all over the place looking at a function here and a function there then you will drag in a lot of pages, most of which will be wasted because you used only a small fraction. If on the other hand you are careful about which methods you touch during startup then you will find that you can use a much smaller slice — it’s more about density than it is about size.

    This isn’t really a new problem, cold startup has been touchy like this since as long as I’ve been programming. We used to do swap tuning of linker overlays in the bad old days, it was really the exact same issue.

  10. Thanks. I asked because the numbers you give are so large — I did not understand how it could be so slow, unless the dlls are huge. 350K is the size of a dll we produce, which is dependent only on System, System.Windows.Forms, and System.Drawing. I am not sure of the cold start, but the warm start seems instantaneous with our test apps. We cannot test without the ngen on the MS libs.

  11. Saurabh said:

    "The follow up question is, does the sharing of ngen images happen if the same dll (strongly signed) is installed at two different locations."

    A better question will be: Why do you want to put the same dll in different locations? Why don’t you put it in GAC?

  12. I've been hearing from several colleagues about how their Visual Studio solution files have many