This topic is worth discussing in some detail because it pretty common to apply formulas outside the assumptions implicit in them, and simply get wrong conclusions. The most fundamental part of the formula is
- Cold Startup = WarmStartup + DiskIOTime
Technically this is only accurate if WarmStarup time and DiskIOTime don’t overlap in time. For example, if DiskIOTime and WarmStartup time could be done completely concurrently then ColdStartup time would be the maximum, not the sum of the two times, which could be quite different. So the question is how much overlap is there between the disk and the CPU?
Well in one very common case the answer is simple. If the application only has one thread of execution (eg. you did not use System.Thread, or System.Threadpool in your startup path), and you don’t do any significant explicit file I/O in your application on startup, then the times will be additive (the formula above is correct). In this very common scenario, the vast majority of disk time is time spent fetching the program itself from the disk. The operating system does this by doing what is called ‘demand paging’. The idea is that in one operation (the API is called LoadLibrary), the entire file is read into memory, however all the data is not actually read in at that time, but is done ‘lazily’ on demand. Basically whenever the CPU tries to access memory that was supposed to be read in, but is not actually been read in, the OS takes a ‘hard page fault’, stops what it is doing and fetches the needed page from disk (on x86 this happens pages are 4096 bytes).
The upshot of this is that during cold startup the CPU is constantly trying to access some new part of the program which needs to be paged in (it takes a hard page fault). Each time it has to stop, and get data from the disk. Since it really needs that data (typically the next instruction to executed) to proceed, so it simply has to wait until the I/O is done. Thus there is no overlap between the CPU and disk I/O for that thread of execution. Hard page faults typically take 4-8msec to service, which translate into 8 to 16 million instructions on a 2Ghz machine (so a hard page fault is quite significant), which h is why cold startup times can easily be many seconds long.
It is possible that if there were multiple threads of execution during startup, that while one thread is waiting for I/O to complete it can be doing something useful on another thread (thus CPU and I/O can overlap), so this may invalidate the formula above, however as we will see, the even in this case the windows prefecher keeps the formula working.
Even single threaded programs might be able to overlap I/O and CPU if they do a lot of explicit, sequential file reads during startup. The reason is that when a Stream.Read is executed the operating system knows that the program is very likely to want to continue reading even after it has return the data that is actually requested. Thus it issues the disk requests to prefech the next chunks of data from the disk and allows the disk to fetch them while the program operates on the last chunk that it was given. If the program does a lot of computation and explicit sequential reads, then the CPU tends to completely overlap the I/O and the formula is wrong (it is too pessimistic). This can happen for example in compilers, where allot of explicit reads as done as well as lot of computation (although even there, the bulk of the computation happens later, after all the data is read in, and thus does not overlap with I/O).
Finally there is one more interesting case of overlap. In the formula above all disk I/O is assumed to not overlap with itself. This is correct if there is only one disk involved, but if multiple disks are involed obviously overlap is likely to happen. Thus the formula is for the 1-disk case and is only an upper bound for the mulit-disk case. In practice, few applications have control over the hardware configuration of the system, and in fact most client machines have only one disk, so the formula covers what is likely to be the most interesting case, but the assumption should be called out.
The upshot is that the formula is valid for programs single threaded programs that don’t do allot of sequential file I/O. It is an upper bound for all programs, and frankly, tends to be pretty accurate even in those cases, because events tend to conspire to prevent CPU and I/O from overlapping effectively.
In my next blog I will talk about the windows prefetcher, which is some very cool technology that has been in the OS since windows XP, and happens to make the cold startup formula accurate for pretty much any program.