Understanding the classical model for linking, groundwork: The algorithm


The classical model for linking goes like this:

Each OBJ file contains two lists of symbols.

  1. Provided symbols: These are symbols the OBJ contains definitions for.

  2. Needed symbols: These are symbols the OBJ would like the definitions for.

(The official terms for these are exported and imported, but I will use provided and needed to avoid confusion with the concepts of exported and imported functions in DLLs, and because provided and needed more clearly captures what the two lists are for.)

Naturally, there is other bookkeeping information in there. For example, for provided symbols, not only is the name given, but also additional information on locating the definition. Similarly, for needed symbols, in addition to the name, there is also information about what should be done once its definition has been located.

Collectively, provided and needed symbols are known as symbols with external linkage, or just externals for short. (Of course, by giving them the name symbols with external linkage, you would expect there to be things known as symbols with internal linkage, and you'd be right.)

For example, consider this file:

// inventory.c

extern int InStock(int id);

int GetNextInStock()
{
  static int Current = 0;
  while (!InStock(++Current)) { }
  return Current;
}

This very simple OBJ file has one provided symbol, Get­Next­In­Stock: That is the object defined in this file that can be used by other files. It also has one needed symbol, In­Stock: That is the object required by this file in order to work, but which the file itself did not provide a definition for. It's hoping that somebody else will define it. There's also a symbol with internal linkage: Current, but that's not important to the discussion, so I will ignore it from now on.

OBJ files can hang around on their own, or they can be bundled together into a LIB file.

When you ask the linker to generate a module, you hand it a list of OBJ files and a list of LIB files. The linker's goal is to resolve all of the needed symbols by matching them up to a provided symbol. Eventually, everything needed will be provided, and you have yourself a module.

To do this, the linker keeps track of which symbols in the module are resolved and which are unresolved.

  • A resolved symbol is one for which a provided symbol has been located and added to the module. Under the classical model, a symbol can be resolved only once. (Otherwise, the linker wouldn't know which one to use!)

  • An unresolved symbol is one that is needed by the module, but for which no provider has yet been identified.

Whenever the linker adds an OBJ file to the module, it goes through the list of provided and needed symbols and updates the list of symbols in the module. The algorithm for updating this list of symbols is obvious if you've been paying attention, because it is a simple matter of preserving the invariants described above.

For each provided symbol in an OBJ file added to a module:

  • If the symbol is already in the module marked as resolved, then raise an error complaining that an object has multiple definitions.

  • If the symbol is already in the module marked as unresolved, then change its marking to resolved.

  • Otherwise, the symbol is not already in the module. Add it and mark it as resolved.

For each needed symbol in an OBJ file added to a module:

  • If the symbol is already in the module marked as resolved, then leave it marked as resolved.

  • If the symbol is already in the module marked as unresolved, then leave it marked as unresolved.

  • Otherwise, the symbol is not already in the module. Add it and mark it as unresolved.

The algorithm the linker uses to resolve symbols goes like this:

  • Initial conditions: Add all the explicitly-provided OBJ files to the module.

  • While there is an unresolved symbol:
    • Look through all the LIBs for the first OBJ to provide the symbol.

    • If found: Add that OBJ to the module.
    • If not found: Raise an error complaining of an unresolved external. (If the linker has the information available, it may provide additional details.)

That's all there is to linking and unresolved externals. At least, that's all there is to the classical model.

Next time, we'll start looking at the consequences of the rules for classical linking.

Sidebar: Modern linkers introduce lots of non-classical behavior. For example, the rule

  • If the symbol is already in the module marked as resolved, then raise an error complaining that an object has multiple definitions.

has been replaced with the rules

  • If the symbol is already in the module marked as resolved:
    • If both the original symbol and the new symbol are marked __declspec(selectany), then do not raise an error. Pick one arbitrarily and discard the other.

    • Otherwise, raise an error complaining that an object has multiple definitions.

Another example of non-classical behavior is dead code removal. If you pass the /OPT:REF linker flag, then after all externals have been resolved, the linker goes through and starts discarding functions and data that are never referenced, taking advantage of another non-classical feature (packed functions) to know where each function begins and ends.

But I'm going to stick with the classical model, because you need to understand classical linking before you can study non-classical behavior. Sort of how in physics, you need to learn your classical mechanics before you study relativity.

Comments (23)
  1. CPDaniel says:

    This first linkers I used baack in the early 80's were even simpler than this classical model:  they would only make a single pass through the LIB files and if anything was undefined, they'd indicate an error.  It was necessary with these linkers to include a module in a library more than once in order to handle circular dependencies.  This was in the CP/M-80 days when the most popular languange compiler around was Digital Research Pascal/MT+.

  2. Adam Rosenfield says:

    Another non-classical behavior: weak symbols vs. strong symbols.  Symbols are strong by default and can be marked as weak with the attribute((weak)) syntax in GCC and GCC-like compilers (I'm unsure if any of the Microsoft compilers support a similar feature).

    If the symbol is already in the module marked as resolved:

    • If both the original symbol and the new symbol are marked attribute((weak)), then do not raise an error.  Pick one arbitrarily and discard the other.

    • If one of the original symbol and the new symbol is marked attribute((weak)), then do not raise an error.  Pick the strong symbol and discard the weak symbol.

    • Otherwise, raise an error complaining that an object has multiple definitions.

  3. Danny Moules says:

    @Adam You may find this interesting RE MSVC behaviour: stackoverflow.com/…/gcc-style-weak-linking-in-visual-studio (There's a neat answer using an undocumented linker pragma).

  4. Michiel says:

    While we're still at groundwork, let's make it explicit: "symbols" to a classical linker are just functions and variables. Constants already get tricky.

    Personally the most surprising thing was the fact that linkers link OBJ files, not functions and variables. And even with /Gy and /OPT:REF linking still doesn't work by starting at main() and recursively adding to that.

  5. Joshua says:

    Isn't there something in the classical model where a later .LIB cannot reference an .OBJ in an earlier .LIB?

  6. Matt says:

    What is the use-case for __select_any? I've written C/C++ for years, and I can't for the life of me think of a good reason for it, other than to introduce non-deterministic bugs to your build.

    [The C++ compiler uses them for inline functions whose addresses are taken and template methods. -Raymond]
  7. Matt says:

    Ok – according to the MSDN documentation:

    "A global data item can normally be initialized only once in an EXE or DLL project. selectany can be used in initializing global data defined by headers, when the same header appears in more than one source file. selectany is available in both the C and C++ compilers."

    So basically it's so you can do "int MyExternVar = 0;" in a header and not get the linker to explode with errors telling you that every C file is defining overlapping "MyExternVar"s.

  8. Joshua says:

    @Matt: I wouldn't have dared use it for variables as the compiler is allowed to direct-bind within an .OBJ file, but for functions declared in header files nothing can really go wrong.

  9. dave says:

    "symbols" to a classical linker are just functions and variables. Constants already get tricky.

    From what I recall, that's not true.  A constant is just an absolute symbol, a variable is a relocatable symbol.

    (Based on vague memories of the linkers I used in the mid-1970s, which I think are old enough to be called classical)

    Isn't there something in the classical model where a later .LIB cannot reference an .OBJ in an earlier .LIB?

    No, that's just a limitation in some implementations.  Not the DEC linkers I used in the mid-1970s.

  10. Matt says:

    @Evan: In MSVC __inline or __forceinline functions don't get external linkage – they get immediately inlined during compilation of the function. I use __forceinline functions in headers all the time and never even thought that selectany might exist.

  11. waleri says:

    I was always wondering why the linker first links everything and then eliminates unused instead doing it the other way around – starting from a known entry point and then adding only referred segments.

    As for the "Packed functions" and "selectany", I believe an OBJ file with multiple segments and/or common linkage were always supported, the only question was why compilers didn't use it.

    [Tune in tomorrow! -Raymond]
  12. alegr1 says:

    "selectany" symbols are necessary for:

    a) member functions defined in the class definition body, when inlining is disabled for debugging;

    b) vtable;

    c) identical non-inline template functions (and member functions of template classes) generated in different files;

    Also, ICF – identical COMDAT folding – needs to be disabled for a debug build, because it makes setting breakpoints in different functions with identical code impossible.

  13. Joshua says:

    a) member functions defined in the class definition body, when inlining is disabled

    These used to end up static.

    b) vtable;

    It went with the first virtual function.

    c) identical non-inline template functions (and member functions of template classes) generated in different files;

    Believe it or not these used to end up static too, which is why old geezers were wary of templates.

  14. alegr1 says:

    > a) member functions defined in the class definition body, when inlining is disabled

    These used to end up static.

    And how would you set breakpoints in them?

    > b) vtable;

    It went with the first virtual function.

    …which can be defined in the class definition body.

  15. Evan says:

    One more comment on "inline" for the moment.

    C++03 standard, 7.1.2 [Function specifiers], footnote #79: "The 'inline' keyword has no effect on the linkage of a function". Para 4: "If a function with external linkage is declared inline in one translation unit, it shall be declared inline in all translation units in which it appears; no diagnostic is required. An inline function with external linkage shall have the same address in all translation units. A static local variable in an extern inline function always refers to the same object. A string literal in an extern inline function is the same object in different translation units."

    So I stand by my earlier statement that C++03 'inline' will leave functions with external linkage, and because of the other requirements of that paragraph you need something like 'selectany'.

    What MSVC does is a somewhat different matter.

  16. Joshua says:

    >> a) member functions defined in the class definition body, when inlining is disabled

    >These used to end up static.

    And how would you set breakpoints in them?

    Dunno but my debugger did.

    > b) vtable;

    >It went with the first virtual function.

    …which can be defined in the class definition body.

    I remember that yielding an error on my old compiler as soon as any virtual function was defined in the body and the header file was included in two .CPP files.

  17. jon w says:

    Some linkage models also keep a list of required symbols per symbol rather than per OBJ. This leads to less spurious dependency growth. You can get this behavior in gcc by passing -ffunction-sections.

    Other linkage models require the symbol to be found in a container of a specific name (spwcific dynlib) which cuts down on sprouts breakage when unrelated libraries introduce clashing symbol names.

    Both of these are, IMO, significant improvements over the classic Unix model.

  18. Jonathan says:

    I've used __declspec(selectany) many times when I needed a module-wide global var. Easier than having it once in a .h, and actually declared in a .c[pp] (or doing macro-trickery to achieve that). Just need to be careful not to include the .h in multiple modules, since each one will get its own instance of the global var, which may not be what you want.

  19. Evan says:

    @Matt:

    The thing that made selectany necessary I suspect was templates. (I'm familiar with a mechanism of similar description to selectany but am not familiar enough with MSVC specifically to know if that's what the exact mechanism it uses in the following situation.) Basically the only way that templates work (ignoring 'export' which almost no one supports anyway) is by putting the template definitions in headers. Each compilation unit in which the template is instantiated will have the symbols. If I have foo.cpp with a call to 'vector<int>::push_back(int)' and bar.cpp with a call to 'vector<int>::push_back(int)', then the compiler will instantiate that function in both compilation units, generating code in both. Under the classical model, at link time: poof, duplicate definition.*

    There are other places it gets used — in particular, functions defined in headers (e.g. because you want them inlined) — but templates are what basically forced the linker's hand and left little choice.

    * The reason that this is basically how templates *have* to work is the following. Let's suppose you are defining some 'template<typename T> class MyVector' in MyVector.cpp, and instantiating it with 'MyVector<string>' and 'MyVector<int>' in Stuff.cpp. (There would be a non-defining declaration in MyVector.hpp that's included from Stuff.cpp.) The compiler never knows what it needs to do! When compiling 'MyVector.cpp' it knows what the C++ code for the MyVector template is, but it doesn't know what types it must instantiate it with. (Each instantiation gets separate generated code.) When compiling 'Stuff.cpp' it knows what types 'MyVector' has been instantiated with, but it doesn't know what the C++ code is and so can't do the instantiation! The only resolutions are (1) to require that the C++ code and instantiations are co-located (which is usually done by putting the definitions of the template into the header file), or (2) to have the compiler produce a list of instantiations which should be done at link time. The latter puts a lot of burden on the linker and will make links take potentially a long time.

  20. Evan says:

    @Matt:

    I'm not positive about this (there is conflicting information), but in C++, functions marked 'inline' still have external linkage. I didn't test, but from the MSDN docs it seems that MSVC is non-compliant in this regard (assuming that I'm correct). I also can't speak to __inline/__forceinline. This can arise if the compiler refuses to inline the function for some reason, or if the address of the function is taken: my understanding is that the address of even an inline function 'foo' is guaranteed to be the same across compilation units. For the same reasons that templates force 'selectany', so does this: the compiler must generate code for 'foo' in each compilation unit in which its address is taken, but it must merge those definitions at link time.

    Even if the above is untrue for C++98/03, it is true in C++11.

    In C, this is not true, and 'inline' functions have internal linkage.

    However, I believe that even in MSVC there is a case where this occurs, which is member functions defined inline in a class ('class C { int random() { return 9; } };'). What the standard says about the linkage of *these* functions is much less conflicted, and they get external linkage — and thus multiple definitions require collapsing duplicated symbols.

  21. MCC says:

    Cool topic – looking forward to the rest.  Thanks Raymond!

  22. Matt says:

    @Evan: "One more comment on "inline" for the moment."

    I wasn't talking about "inline", I was talking about "__inline" – which is MSVC specific and isn't required to adhere to C++ standards.

  23. Evan says:

    @Matt:

    I get that; I was just trying to say how the C++ standard basically necessitates a feature like selectany.

Comments are closed.