The windows prefetcher

In my last blog I talked about some of the conditions than need to hold for the cold startup formula

ColdStartupTimeMSec = WarmStartupTimeMsec + 4 * NumberOfReads + 20 * NumberMBytes

To be accurate.  I mentioned that if you have overlap between the CPU and the disk then it may not be accurate (although it would be an upper bound).   It turns out that there is a feature of the OS that will actually makes the cold startup formula work all the time, even in those cases I call out above where it might not work.  That feature is the OS prefetcher and is in windows XP and beyond.

If you look at the I/O caused by running a program at cold start, you will find that the disk tends to be accessed randomly.   You need a page here, then a page in some other part of the file, as the CPU executes code from all over the place.  This causes literally thousands of individual Disk accesses, and as we have seen, each disk seek costs us 4msec, so seek time dominates, and adds up to seconds of startup time. 

The observation is that this 'lazy loading' of data from the disk is actually very inefficient because it loads the data in small pages and each page costs something like 4msec to move the disk head.  It would be much better if you could read the data in big chunks, even if you end up reading in data that you did not end up using. 

The prefetcher fixes uses this observation to speed up cold startup. On EVERY program load, the OS keeps track of what data that program needed to be paged in (even if they were not 'hard' page faults).   This data is persisted in binary form in a  *.PF file in c:\windows\prefetch (go look at that directory).    When a process is created, the first thing that is done is that this *.PF file is consulted.   It then issues ALL the file I/O specified in this file, at once and blocks the program until it competes.   These I/O requests are nicely sorted and combined.  If several I/O are adjacent (or even close to one another), a large combined I/O is issued instead (which avoids a seek), the I/O are also nicely ordered so that the disk head moves the smallest possible distance between reads.  This makes the disk I/O as efficient as it can be, and because disk I/O is the slow part of cold startup time, it leads to dramatic improvements in cold startup time (often the prefetcher cuts cold startup time in half). 

Note that the prefecher does not allow the program to begin until all the I/O is complete.  I am speculating they did this because they did not want random I/Os from the executing program to interfere with the beautify optimized disk head movement.  However this has a unfortunate consequence.  What it means is that things like splash screens (which get put up early to alert the user that something is indeed happening), do not get up as quickly as you would like (because NO code in the process runs until ALL the I/O completes).  For this reason, the prefetcher has a limit on the amount of data it will prefetch (on XP this is 16MB).  This keeps the maximum pause caused by the prefetcher to a tolerable amount (By our formula above 16 MB * 20 msec /MB = 320msec + probably about 100 seeks or so which is still < 1 sec). 

One side effect of the prefetcher is that for programs under its data size limit, the there can be NO overlap between CPU and Disk I/O during startup (the program is not allowed to run during the initial priming phase).   So this is yet another reason that tends to keep the formula accurate.

 Finally I just wanted to close by noting that the formula's usefulness is not so much that it is accurate in all cases, but rather that it is accurate enough and very simple, and thus can be used to reason about what can be done to improve cold startup.   The formula states that you can attack cold startup by

  • Attacking warm startup (and parallelizing CPU work is one way do doing this),
  • Attacking the amount of data read of the disk (minimizing the amount of data needed to startup is the way to do this)
  • Attacking the number of seeks (packing data so that it is not scattered about within a DLL does this)

 It attaches real values to each of these components so that for a given scenario you can know how each component is contributing and thus which is the most profitable component to attack.   This is its real value.