The NT DLL Loader: basic operation

Let's start simply and consider the mythical vertical application (a topic itself for another day).  I'm not going to walk through what a PE is or a DLL or an EXE; if you don't know, follow the link or take a gander around MSDN.

Let's call it ccalc.exe (console calculator - no funky graphics, etc. it just uses stdin/stdout).  To keep things simple, it only links against the old C runtime library, msvcrt.dll.  msvcrt.dll imports functions from kernel32.dll and kernel32.dll imports functions from ntdll.dll.

When the process initializes, the loader figures this imports-graph from the static imports in the PE headers and since code is still sequential, sets up a DLL initialization list which is just a list of DLLs to try to initialize, in order.  The initialization is depth first but other than that it's nothing particularly suprising.  For this example, the init order would be:


Not exactly rocket science.  ccalc.exe doesn't have an initializer - it has an entry point where its initialization is done prior to passing control to whatever you like to call your main function.  (If you don't use the C runtimes, you can tell the linker to use your own function as the entry point but if you do use the C runtime functions, the C runtime needs to get control first so that it can set up its heap, set up the standard streams and do some other various things like set up fp control registers and whatever else it needs to do and then it calls your main() or wmain() function depending on which C runtime init function you set as the entry point.)

Now let's assume you have two teams working on the calculator - one on the "UI" (command line parsing) and another on the logic.  Maybe the logic folks want to have a separate DLL.  Let's assume that they do and it's called calclogic.dll.  This isn't necessarily a good way to do it but let's say that the Super Genius command line parser folks want to use LoadLibrary()/GetProcAddress() to figure out if a function typed in can be used by the calculator logic component.  (Insert your favorite reason to use DLLs here; there are a million good ones and two million bad ones.)

But the calclogic.dll SubGeniuses cleverly didn't do all the work themselves.  They use bignum.dll to handle all that fancy pants math stuff.  Nice job!

So, ccalc.exe calls LoadLibrary() on calclogic.dll.  Now what happens?

Bignum.dll is presumably a bignum package and it needs a heap since the representations are variable length.  Let's assume that it uses metaheap.dll which they paid a lot of money for but which ends up really just being a big fat wrapper around GlobalAlloc in kernel32.dll.

When the loader sees the load come in for calclogic.dll, it looks at what it imports and figures out what it is importing that isn't already imported into the process.  So it looks at calclogic.dll and discovers bignum.dll, and then metaheap.dll and then kernel32.dll and then... hey, wait, kernel32.dll's already in the loader's database so we can stop.  After processing the static imports, it appends the new entries to the global initialization list and keeps track of the new entries.  To wit, the global init-order list looks like:


and the init list that it has to process before LoadLibrary() returns is:


No suprises yet, eh?

Now let's add the first twist.  There are cycles in the dependency graphs of many DLLs.  Why?  Well, because they are.  Two teams or two companies were independently working on something which maybe was conceptually part of the same thing but they had to build, test and deliver the two things independently.  Component granularity is suprisingly hard to get right and very rarely has to do with what's best for the components.  Instead it's a combination of what's best for the teams (teams don't like to have multiple DLLs and teams want autonomy w.r.t. a single DLL so that their testers can feel comfortable about what it means to test the component) and what's best for the perf team (one DLL per function would destroy system performance; in a big system, if 10% of the teams "just add one more DLL" for each release, you end up with dozens and dozens over time for no apparent reason other than it was easier than modifying an existing DLL).

What does the loader do when it finds these?  not much.  I already described the algorithm above when the dynamic load case reached kernel32 again.  (It's a little more complicated because of the interaction of refcounting and cycles but I'll make a separate post on that.)

If A depends on B and B depends on A, the initialization order depends on which one you found first.  If you loaded A first, it would be added to the tables and then B would be found and added to the loader's tables.  Then when walking B's imports, it would find A again and say that it's over and done with.  Thus if you loaded A, B would be initialized first since it would appear to be deeper in the graph than A.  Similarly loading B first will result in A initialized first.

The implementation is linear/sequential.  Therefore even the order of the imports in your static import tables matters.  I can't guarantee which order is used but assume you have DLL c.dll which links against both a.dll and b.dll.  If a.dll happens to be in the import tables first (assuming that the imports are walked in order), it will be added first and then b would be discovered and we'd decide that b was deeper.  If the linker for some reason reverses the order of the static imports, you'll see the opposite.

Now let's say you're ccalc.exe and you link against c.dll.  The init order might go B -> A -> C.  But now let's say that ccalc decides to link against b.dll itself.  Now maybe the init order is A -> C -> B!

This leads to the first hazards in DLL initialization order:

The way that cycles are resolved into an initialization order are stable for any given graph but if the graph changes for some reason, the initialization order may change.

If A calls into B during its DLL_PROCESS_ATTACH, this may have worked for years and years but suddenly because of either the addition or removal of an import by some other DLL in the graph can break the initialization order.

My team debugged so many of these during Windows XP it wasn't funny.  Sure we were the obvious targets, who are those crazy folks over there changing the DLL loader?!? Are they mad?  They must be the cause that my code which has worked for 10+ years is suddenly broken!  But we got pretty good at recognizing the defective patterns.  (The changes themselves weren't defective - anyone adding or heavens forbid actually reducing/removing dependencies would cause these effects.)

I even have recommendations about what (not!) to do during DLL initialization to avoid these problems yourself.  They'll be at the end of the series and I can say right now that I don't think you're going to like them one bit.

Comments (6)

  1. I don’t normally post on weekends, but I just noticed that Michael Grier’s finally started posting his…

  2. Amit says:

    This is very interesting topic, good to see this series coming out. Previously an article on MSDN Magazine, "What Goes On Inside Windows 2000: Solving the Mysteries of the Loader" was very helpful in understanding the guts of the loader.

    What I would also like to see is the detailed executable loading process. The book ‘Windows Internals’ covers parts of it in the section of process startup. The part that is particularly interesting is to understand the details of how/when:

    1. The executable image is mapped

    2. Ntdll is is mapped

    3. The initial APC in which some mapping occurs

    and so on…

  3. MGrier says:

    I actually don’t plan to get into process initialization too much, for several reasons:

    1. You can actually figure most of it yourself by using CreateProcess() but creating the process with the initial thread suspended. You can use a usermode debugger then and look at the thread state and see what’s up.

    1a. I usually get some part of it wrong. I myself get confused about what’s done based on the code inside user mode CreateProcess() establishing the frame at the base of the initial thread’s stack vs. the queued APC. I feel confident that I’ll get it wrong. 🙂

    2. Once again, I’m walking a thin line about documenting an implementation’s visible side effects vs. documenting a contract. I’ll certainly give hints about how you can determine the implementation’s visible possibly unintended behaviors but I don’t want to be the one who "documents" how it works and is seen as taking away flexibility. Maybe I’m a chicken here (no offense to chickens!) but I also have no desire to be the first example of someone who described too much technical detail and suffered for it.

    Really, you can figure most of it out yourself using docuemnted techniques and I /can/ help you out with that.

  4. mark lucovsky says:

    great post. you did a good job summarizing this.

    don’t be afraid of getting the initial apc and startup wrong. if you read the code you will understand. it is far simpler than the loader. trust me.


Skip to main content