Why should I even bother to use DLL's in my system?

At the end of this blog entry, I mentioned that when I drop a new version of winmm.dll on my machine, I need to reboot it. Cesar Eduardo Barros asked:

Why do you have to reboot? Can't you just reopen the application that's using the dll, or restart the service that's using it?

It turns out that in my case, it’s because winmm’s listed in the “Known DLLs” for Longhorn. And Windows treats “KnownDLLs” as special – if a DLL is a “KnownDLL” then it’s assumed to be used by lots of processes, and it’s not reloaded from the disk when a new process is created – instead the pages from the existing DLL is just remapped into the current process.

But that and a discussion on an internal alias got me to thinking about DLL’s in general. This also came up during my previous discussion about the DLL C runtime library.

At some point in the life of a system, you decide that you’ve got a bunch of code that’s being used in common between the various programs that make up the system. 

Maybe that code’s only used in a single app – one app, 50 instances.

Maybe that code’s used in 50 different apps – 50 apps, one instance.

In the first case, it really doesn’t matter if you refactor the code into a separate library or not. You’ll get code sharing regardless.

In the second case, however, you have two choices – refactor the code into a library, or refactor the code into a DLL.

If you refactor the code into a library, then you’ll save in complexity because the code will be used in common. But you WON’T gain any savings in memory – each application will have its own set of pages dedicated to the contents of the shared library.

If, on the other hand you decide to refactor the library into its own DLL, then you will still save in complexity, and you get the added benefit that the working set of ALL 50 applications is reduced – the pages occupied by the code in the DLL are shared between all 50 instances.

You see, NT's pretty smart about DLL's (this isn’t unique to NT btw; most other operating systems that implement shared libraries do something similar). When the loader maps a DLL into memory, it opens the file, and tries to map that file into memory at its preferred base address. When this happens, memory management just says “The memory from this virtual address to this other virtual address should come from this DLL file”, and as the pages are touched, the normal paging logic brings them into memory.

If they are, it doesn't go to disk to get the pages; it just remaps the pages from the existing file into the new process. It can do this because the relocation fixups have already been fixed up (the relocation fixup table is basically a table within the executable that contains the address of every absolute jump in the code for the executable – when an executable is loaded in memory, the loader patches up these addresses to reflect the actual base address of the executable), so absolute jumps will work in the new process just like they would in the old. The pages are backed with the file containing the DLL - if the page containing the code for the DLL’s ever discarded from memory, it will simply go back to the DLL file to reload the code pages. 

If the preferred address range for the DLL isn’t available, then the loader has to do more work. First, it maps the pages from the DLL into the process at a free location in the address space. It then marks all the pages as Copy-On-Write so it can perform the fixups without messing the pristine copy of the DLL (it wouldn’t be allowed to write to the pristine copy of the DLL anyway). It then proceeds to apply all the fixups to the DLL, which causes a private copy of the pages containing fixups to be created and thus there can be no sharing of the pages which contain fixups.

This causes the overall memory consumption of the system goes up. What’s worse, the fixups are performed every time that the DLL is loaded at an address other than the preferred address, which slows down process launch time.

One way of looking at it is to consider the following example. I have a DLL. It’s a small DLL; it’s only got three pages in it. Page 1 is data for the DLL, page 2 contains resource strings for the DLL, and page 3 contains the code for the DLL. Btw, DLL’s this small are, in general, a bad idea. I was recently enlightened by some of the office guys as to exactly how bad this is, at some point I’ll write about it (assuming that Raymond or Eric don’t beat me too it).

The DLL’s preferred base address is at 0x40000 in memory. It’s used in two different applications. Both applications are based starting at 0x10000 in memory, the first one uses 0x20000 bytes of address space for its image, the second one uses 0x40000 bytes for its image.

When the first application launches, the loader opens the DLL, maps it into its preferred address. It can do it because the first app uses between 0x10000 and 0x30000 for its image. The pages are marked according to the protections in the image – page 1 is marked copy-on-write (since it’s read/write data), page 2 is marked read-only (since it’s a resource-only page) and page 3 is marked read+execute (since it’s code). When the app runs, as it executes code in the 3rd page of the DLL, the pages are mapped into memory. The instant that the DLL writes to its data segment, the first page of the DLL is forked – a private copy is made in memory and the modifications are made to that copy. 

If a second instance of the first application runs (or another application runs that also can map the DLL at 0x40000), then once again the loader maps the DLL into its preferred address. And again, when the code in the DLL is executed, the code page is loaded into memory. And again, the page doesn’t have to be fixed up, so memory management simply uses the physical memory that contains the page that’s already in memory (from the first instance) into the new application’s address space. When the DLL writes to its data segment, a private copy is made of the data segment.

So we now have two instances of the first application running on the system. The space used for the DLL is consuming 4 pages (roughly, there’s overhead I’m not counting). Two of the pages are the code and resource pages. The other two are two copies of the data page, one for each instance.

Now let’s see what happens when the second application (the one that uses 0x40000 bytes for its image). The loader can’t map the DLL to its preferred address (since the second application occupies from 0x10000 to 0x50000). So the loader maps the DLL into memory at (say) 0x50000. Just like the first time, it marks the pages for the DLL according to the protections in the image, with one huge difference: Since the code pages need to be relocated, they’re ALSO marked copy-on-write. And then, because it knows that it wasn’t able to map the DLL into its preferred address, the loader patches all the relocation fixups. These cause the page that contains the code to be written to, and so memory management creates a private copy of the page. After the fixups are done, the loader restores the page protection to the value marked in the image. Now the code starts executing in the DLL. Since it’s been mapped into memory already (when the relocation fixups were done), the code is simply executed. And again, when the DLL touches the data page, a new copy is created for the data page.

Once again, we start a second instance of the second application. Now the DLL’s using 5 pages of memory – there are two copies of the code page, one for the resource page, and two copies of the data page. All of which are consuming system resources.

One think to keep in mind is that the physical memory page that backs resource page in the DLL is going to be kept in common among all the instances, since there are no relocations to the page, and the page contains no writable data - thus the page is never modified.

Now imagine what happens when we have 50 copies of the first application running. There are 52 pages in memory consumed by the DLL – 50 pages for the DLL’s data, one for the code, and one for the resources.

And now, consider what happens if we have 50 copies of the second application running, Now, we get 101 pages in memory, just from this DLL! We’ve got 50 pages for the DLL’s data, 50 pages for the relocated code, and still the one remaining for the resources. Twice the memory consumption, just because the DLL was wasn’t rebased properly.

This increase in physical memory isn’t usually a big deal when it’s happens only once. If, on the other hand, it happens a lot, and you don’t have the physical RAM to accommodate this, then you’re likely to start to page. And that can result in “significantly reduced performance” (see this entry for details of what can happen if you page on a server).

This is why it's so important to rebase your DLL's - it guarantees that the pages in your DLL will be shared across processes. This reduces the time needed to load your process, and means your process working set is smaller. For NT, there’s an additional advantage – we can tightly pack the system DLL’s together when we create the system. This means that the system consumes significantly less of the applications address space. And on a 32 bit processor, application address space is a precious commodity (I never thought I’d ever write that an address space that spans 2 gigabytes would be considered a limited resource, but...).

This isn’t just restricted to NT by the way. Exchange has a script that’s run on every build that knows what DLLs are used in what processes, and it rebases the Exchange DLL’s so that they fit into unused slots regardless of the process in which the DLL is used. I’m willing to bet that SQL server has something similar.

Credits: Thanks to Landy, Rick, and Mike for reviewing this for technical accuracy (and hammering the details through my thick skull). I owe you guys big time.