Calling an imported function, the naive way

An import library resolves symbols for imported functions, but it isn't consulted until the link phase. Let's consider a naive implementation where the compiler is blissfully unaware of the existence of imported functions.

In the 16-bit world, this caused no difficulty at all. The compiler generated a far call instruction and left an external record in the object file indicating that the address of the function should be filled in by the linker. At that time, the linker realizes that the external symbol corresponds to an imported function, so it takes all the call targets, threads them together, and creates an import record in the module's import table. At load time, those call entries are fixed up and everybody is happy.

Let's look at how a naive 32-bit compiler would deal with the same situation. The compiler would generate a normal call instruction, leaving the linker to resolve the external. The linker then sees that the external is really an imported function, and, uh-oh, the direct call needs to be converted to an indirect call. But the linker can't rewrite the code generated by the compiler. What's a linker to do?

The solution is to insert another level of indirection. (Warning: The information below is not literally true, but it's "true enough". We'll dig into the finer details later in this series.)

For each exported function in an import library, two external symbols are generated. The first is for the entry in the imported functions table, which takes the name __imp__FunctionName. Of course, the naive compiler doesn't know about this fancy __imp__ prefix. It merely generates the code for the instruction call FunctionName and expects the linker to produce a resolution.

That's what the second symbol is for. The second symbol is the longed-for FunctionName, a one-line function that consists merely of a jmp [__imp__FunctionName] instruction. This tiny stub of a function satisfies the external reference and in turn generates an external reference to __imp__FunctionName, which is resolved by the same import library to an entry in the imported function table.

When the module is loaded, then, the import is resolved to a function pointer and stored in __imp__FunctionName, and when the compiler-generated code calls the FunctionName function, it calls the stub which trampolines (via the indirect call) to the real function entry point in the destination DLL.

Note that with a naive compiler, if your code tries to take the address of an imported function, it gets the address of the FunctionName stub, since a naive compiler simply asks for the address of the FunctionName symbol, unaware that it's really coming from an import library.

Next time, we'll look at the dllexport declaration specifier and how a less naive compiler generates code for an imported function.

Comments (14)
  1. Jim Dodd says:

    Well, no one else has commented so I’ll start. Thanks for this article. And for all the others so far in the DLL series. I thought I knew a lot about DLLs but found that I didn’t know as much as I thought. I was especially interested in this article because I’ve been attempting to extend our compiler that we use for our embedded language in our line of battery-powered data loggers. Up to this point, we’ve limited ourselves to allowing customers just one source file and we’ve embedded the "library" of functions they can call in the device itself. I was thinking that we should only load the functions the customer needs for the program and also allow customers to create their own libraries of functions in mutilple files. While there are lots of books and classes on compilers, I’m having trouble finding good sources for writing a linker. Ours would be very primitive and you’ve helped me see a direction to take. Keep up the great blogging, Raymond. And, as I write this, it looks like Floyd Landis has wrapped up the Tour – almost.

  2. waleri says:

    I don’t understand. Does that mean that direct JMP or CALL instruction DOES NOT cause queue reload? Where is the next instruction loaded from then?

  3. Myria says:

    Yes.  The processor knows the target of the jump long before it gets there, so it automatically preloads the instructions at the target of the jump.

    It’s the same way branch prediction works, except that it knows for sure that the branch will occur.

  4. Myria says:

    When PE was designed, this way was the best way to implement it.  __declspec(dllimport) makes the compiler do the indirection where possible, and the thunks handle where it isn’t possible.

    The problem is that times have changed.  An indirect call is extremely slow on modern processors, because it causes a full instruction queue reload.  (Keep that in mind next time you are deciding whether you need virtual functions.)

    The way that things should have been done is to use the existing trampoline stubs, except make ntdll’s Ldr* stuff modify the jmp’s themselves instead of simply an import table.  Then the code becomes "call near thunk_SendMessageW".  At thunk_SendMessageW is "jmp near SendMessageW".  The PE loader would modify the bytes after the E9 to point to the correct address.  For security, Ldr* would mark this region as PAGE_EXECUTE_READ after it’s done modifying.

    This is highly specific to x86-32, because x86-64 and PowerPC can’t do a direct jump to anywhere in the address space.  I don’t know IA64 so I have no idea with that one.

    It’s too late to have this at the ntdll level, but such a system could be implemented with a combination of compiler, linker, and crt0 code.  Or ntdll’s Ldr* could have a new option for that kind of import table, and crt0 could do it itself if ntdll didn’t support it.

  5. Yosi says:

    Indirect jump doesn’t flush instruction pipe. Where did you hear that nonsense? It will initiate write-back fifo access (or cache access) to bring target address, but this have nothing to do with instruction queue.

    Actuall, the queue will get stalled in case of cache miss, since main memory access time is very slow compared to core clock speed.

  6. Eggman says:

    Yosi is right, on all x86es since at least the pentium 1 indirect branches like the ones discussed here are predicted to go to the same address as last time. Pentium M and newer CPUs (Core) have a more sophisticated mechanism for indirect calls (in order to handle virtual calls that go to different addresses different time), but that would go unused here.

    And as a side note, since there seems to be some confusion here too, jmp bcc call etc are all branch predicted, so even a jmp is a "prediction" even though it always jumps.

  7. Dean Harding says:

    It would be pretty silly of processor-designers to NOT optimize their CPUs for some of the most common cases – virtual functions and DLL-calls.

    Optimization (at this level, at least) is a two-way street — software designers optmize their code for the CPU and hardware designers optimize the CPU for whatever code executes on them.

  8. Shyguy says:

    > (Keep that in mind next time you are deciding whether you need virtual functions.)

    Wrong way.

    Keep that in mind next time you’re profiling an already working application which is performing too slow.

    Breaking good design because "in some future it might be slow" is the wrong way around; you will end avoiding virtuals at all, when less than 1% of virtual calls will impact performance in some serious way.

  9. BryanK says:

    Dean: But "virtual functions" and "DLL calls" are *not* something the processor designers can optimize for.  Those are several levels of abstraction above the processor.  The code to actually do them is generated by the C++ compiler and library loader, respectively; the choices that Microsoft’s C++ compiler and library loader made are not the only possible choices.

    (Nitpicky?  Well, yes; why do you ask?  :-P)

  10. Chris Becke says:

    BryanK: “But “virtual functions” and “DLL calls” are *not* something the processor designers can optimize for.”

    er, Why not? In software, we optimise first, by profiling, to get an idea of where the software might benefit from optimization.

    I’m not too sure that hardware engineers aren’t under similar economic constraints to spend their time efficiently.

  11. waleri says:

    I am probably stupid or something, but considering instructions like

    jmp/call $addr

    jmp/call $[addr]

    Do you think CPU even bother to *predict* such jump?

    Is call [$addr] *significantly* slower than call $addr, when *both* addr var itself and destination it points to are outside cached memory?

    Yes, it will be slower due to extra read will take place, but I doubt it is like 10 times slower or something

    I think it is all about what "significantly" means in this case…

  12. difference says:

    Is dll-calls always slower than ordinary calls inside a binary?

    [That question is too vague to be answerable. I can invent some really inefficient intra-binary calling conventions. -Raymond]
  13. difference says:

    > Is dll-calls always slower than ordinary calls inside a binary?

    > [That question is too vague to be answerable. I can invent some really inefficient intra-binary calling conventions. -Raymond]

    I’m assumed using VS and other common Windows “standards”. Without constructing your own calling convention.

    Another question: How possible would it be for the linker to embed a dll (it’s code & data) into and exe (and calling it as an lib) to gain performance?

    [It still depends on what compiler and linker flags you used. If you have a suggestion for a future topic, use the suggestion box. -Raymond]
  14. wishfly says:


Comments are closed.

Skip to main content