Rethinking the way DLL exports are resolved for 32-bit Windows


Over the past few days we’ve learned how 16-bit Windows exported and imported functions from DLLs and that the way functions are exported from 32-bit DLLs matches the 16-bit method reasonably well. But the 16-bit way functions are imported simply doesn’t work in the 32-bit world.

Recall that in 16-bit Windows, the fixups for an imported function are threaded through the code segment. This works great in 16-bit Windows since there was a single address space: Code segments were shared globally, and once a segment was loaded, each process could use it. But 32-bit Windows uses separate address spaces. If the fixups were threaded through the code segment, then loading a code page from disk would necessarily entail modifying it to apply the fixups, which prevents the pages from being shared by multiple processes. Even if the fixup table were kept external to the code segment, you would still have to fix up the code pages to establish the jump targets. (With sufficient cleverness, you could manage to share the pages if all the fixups on a page happened to agree exactly with those of another process, but the bookkeeping for this would get rather messy.)

But beyond just being inefficient, the idea of applying import fixups directly to the code segment is downright impossible. The Alpha AXP has a “call direct” instruction, but it is limited to functions that are at most 128KB away. If you want to call a function that is further away, you have to load the destination address into a temporary register and call through that register. And as we saw earlier, loading a 32-bit value into a register on the Alpha AXP is a two-step operation which depends on whether bit 15 of the value you want to load is set or clear. Since this is an imported function, we have no idea at compile or link time whether the target function’s address will have bit 15 set or clear.

(And the Alpha AXP was hardly the only architecture that restricted the distance to direct calls. The Intel ia64 can make direct calls to functions up to 4MB away, and the AMD x86-64 and Intel EM64T architectures can reach up to 2GB away. This sounds like a lot until you realize that they are 64-bit processors with 16 exabytes of address space. Once again, we see that the x86 architecture is the weirdo.)

Both of the above concerns made it undesirable (or impossible) for import fixups to modify code. Instead, import fixups have to apply to data. Rather than applying a fixup for each location an imported function was used, a single fixup is applied to a table of function pointers. This means that calls to imported functions are really indirect calls through the function pointer. On an x86, this means that instead of call ImportedFunction the generated code says call [__imp__ImportedFunction], where __imp__ImportedFunction is the name of the variable that holds the function pointer for that imported function.

This means that resolving imported functions is a simple matter of looking up the target addresses and writing the results into the table of imported function addresses. The code itself doesn’t change; it just reads the function address at runtime and calls through it.

With that simple backgrounder, we are equipped to look at some of the deeper consequences of this design, which we will do next time.

Comments (17)
  1. LarryOsterman says:

    A minor nit: you’ll only see call [__imp__ImportedFunction] if ImportedFunction is marked __declspec(dllimport) in the .h file.

    Otherwise you’ll see:

       call ImportedFunction

    ImportedFunction:

       jmp [__imp__ImportedFunction]

    [More on this on Monday. -Raymond]
  2. Tom says:

    I was going to mention the same thing as Larry, but work got in the way.  :)

  3. Jules says:

    Question: is the table of function pointers writable by the application?  Is there an easy way to (a) find it and (b) determine how it is layed out?

    I can see … err … applications for this. :)

  4. Gabe says:

    Being the only CISC one of the bunch, x86 may be an outlier among the Windows architectures, but it was pretty much the norm when it came out. All of the other architectures (MIPS, Alpha, PPC, IA64) are RISC or post-RISC, and none of them even existed when Windows first came out.

  5. Boogie says:

    I think you see a jump table in debug builds, too — I’m not talking about imported calls, but everyday library calls.

    [That’s a compiler-specific thing. Has nothing to do with DLL imports and exports. -Raymond]
  6. Marcel says:

    Boogie: just a guess, but that sounds like the work of an incremental linker.

  7. josh says:

    Jules:  You CAN use this table to redirect dll imports.  (I think it’s this table…)

  8. Jens Bäckman says:

    This question might fit with the whole DLL theme you’re running now, Raymond…

    We have a bunch of inhouse-developed C++ classes that utilize large parts of the STL, and follows quite a few design patterns. They build and work without a hitch in their original Unix environment. It took me about five minutes to get them to compile in Windows too, and all testcases worked the first try. Everything was wonderful.

    Except for the fact that I had done a static link. After trying to build a DLL with these classes, I got bizarre crashes when deleting objects. A minute of source code inspection told me that I was not allowed to let a factory class in the DLL allocate an object, and then delete it in the main program. Adding a function in the DLL that deleted any pointer you sent to it worked fine, though.

    Any experience or explanation for this behaviour?

  9. CN says:

    Jules: Adding to what josh said, I think this was what the unofficial WMF patch did to patch the calls in-memory.

  10. KiwiBlue says:

    Jens, it seems like your .dll is linked with static CRT. In such case the library and main program have two separate heaps; memory allocated in .exe can’t be freed by .dll and vice versa.

  11. Chris Becke says:

    Jens: The DLL and the Application (EXE) are using different instances of a C-Runtime. Which means, simply, that the heap managers used by the EXE and Dll are incompatible.

    To resolve the problem, make sure that both projects use the same CRT (by, for example, setting both to use msvcrt.dll), or – as you have done – ensuring that things are always deallocated using the same heap manager that allocated them originally.

  12. James Risto says:

    Ok, risking looking like a too-little-knowledge-is-dangerous; is it not true that 32-bit DLL’s need no fixups, because they are loaded at the same address in each processes’ virtual address space? Ok I guess if two DLL’s collide, then fixups are still needed.

  13. Neil says:

    Is the import table preloaded with the preferred addresses of the target functions?

  14. Mike Dimmick says:

    James Risto: If the DLL can load at its preferred base address, then no fixups are required. If it cannot, any absolute addresses still need to be altered, and this work is done by the loader. The x86 has poor support for program counter-relative addressing, so many more absolute addressing operations are needed, and many more fixups are generated for this architecture.

    The addresses of fixups are kept in the .reloc section of the image. To see the size of fixups (relocations) use dumpbin /headers. To see the fixups themselves, run dumpbin /relocations.

    Neil: by default, no, but you can do this by running the bind tool. If the loader detects that the version of the exporting DLL matches the version recorded in the header by ‘bind’, it will skip checking the import table. I think binding becomes less important in Windows Vista due to Address Space Layout Randomization (see http://blogs.msdn.com/michael_howard/archive/2006/05/26/608315.aspx) although the suggestion that there are 256 possible layouts might make it possible to simply set the bits appropriately on a bound image rather than have to actually search the export table.

  15. KJK::Hyperion says:

    A little bit of trivia for fans of the most obscure corners of Windows…

    UNIX executables for Windows (yes, Virgina, there are UNIX executables for Windows) compiled with GCC with the -fPIC flag (Position Independent Code) use, instead, the AT&T standard for relocations, that is a table (the GOT, Global Offset Table) is kept of all pointers used in the code, thus any relocations only affect a contiguous block of data, rather than sparse pages of code

    It is a lot more efficient, and effectively allows any DLL to be relocated at will, but sadly on x86 it has quite some side effects…

    Like, for example, all non-leaf functions have one less register to work with (specifically, EBX), to cache the address of the GOT, which would be bad enough…

    Also, all functions with non-static linkage need to have a short prolog that calculates the position of the GOT relative to the current EIP. Program-counter-based addressing isn’t supported on x86, so the prolog has to emulate it and is quite convoluted – altough it can be skipped for calls within the same module, because the address is cached in EBX

    The alternative approach would be "far" pointers, which can take two forms: "fat" pointers encoding both the address of the GOT (or a fixed offset from it, like the image’s base address) and the address (or offset) of the function, or indirect pointers, which actually point to a structure containing all the required information. Windows employed the former under 16-bit x86 (only differently, because it was a segmented memory architecture), and the latter under MIPS, Alpha and IA64

    Aaand yes. The funny part is that they are PE executables, not ELF. After all, PE is just COFF in drag, and UNIX has run on COFF executables for decades. Did I mention they also support UNIX-style dynamic linking?

  16. Neil says:

    Thanks, Mike.

    One other question I hope someone can answer:

    In 16-bit Windows you could import functions in your .DEF file (which I frequently did, because I was trying to target 3.1 with a 3.0 compiler). Is this no longer possible in 32-bit Windows?

  17. Yuhong Bao says:

    Actually, both x86 and x64 allows near calls up to 2 GB (32-bit and 64-bit) or 32 KB (16-bit) away.

    Neil:

    It still can be done.

Comments are closed.