Linker throughput

Hello, my name is Chandler Shen, a developer from the Visual C++ Shanghai team.

We have made some changes in the upcoming Visual C++ 2010 release to improve the performance of linker. I would like to first give a brief overview of the linker and how we analyze the bottlenecks of current implementation. Later, I will describe the changes we made and the impact on linker performance.

Our Focus

 

We were targeting the linker throughput of large scale projects full build scenario because this scenario matters most in linker throughput scalability. Incremental linking and smaller projects will not benefit from the work I describe in this blog.

Brief Overview of Linker

 

Traditionally, what’s done by linker can be split into two phases:

1.       Pass1: collecting definitions of symbols (from both object files and libraries)

2.       Pass2: fixing up references to symbols with final address (actually Relative Virtual Address) and writing out the final image.

Link Time Code Generation (LTCG)

If /GL (Whole Program Optimization) is specified when compiling, the compiler will generate a special format of object file containing intermediate language. When linker encounters such object files, Pass1 becomes a 2-phase procedure. From these object files, the linker first calls into compiler to collect definitions of all public symbols to build a complete public symbol table. Then the linker supplies this symbol table to the compiler which generates the final machine instructions (or code generation).

Debug Information

During Pass2, in addition to writing the final image, linker will also write debug information into a PDB (Program Database) file if user specifies /DEBUG (Generate Debug Info). Some of this debug information, such as address of symbols, is not decided until linking.

Bottlenecks

In this section, I will show how we analyze some test cases to figure out bottlenecks of performance.

Test Cases

To get an objective conclusion, four real world projects (whose names are omitted) differ in scale, including proj1, proj2, proj3 and proj4, were chosen as test cases.

Table 1 Measurements of test cases

 

Proj1

Proj2

Proj3

Proj4

Files

Total

55

27

168

1066

.obj

4

6

7

882

.lib

51

21

161

184

Symbols

6026

22436

69570

110262

In Table 1, the number of “symbols” is the number of entries of the symbol table which is internally used by linker to store the information of all external symbols. It is noticeable that “proj4” is much bigger than others.

Test Environment

Following is the configuration of the test machine

·         Hardware

o   CPU       Intel Xeon CPU 3.20GHz, 4 cores

o   RAM      2G

·         Software             Windows Vista 32-bit

Results

To minimize the effect of environment, all cases were run for five times. And the unit of time is in seconds.

In Table 2 and Table 3, it showed that for each test case, there is always one (usually the first, marked in red) run which takes much longer than others.  While one run (marked in Green) may take a much shorter run. This is because following two reasons

l  OS will cache a file’s content in memory for next read (called prefetch on Windows XP, and SuperFetch on Windows Vista)

l  Most of modern hard disks will cache a file’s content for next read

 

Comparing Table 2 with Table 3, we can notice that if /debug is off, the time of Pass2 is much shorter. So it indicates that the majority of Pass2 is writing PDB files

Table 2 Test result of Non-LTCG with /Debug On

Pass1

Pass2

Total

Proj1

1

4.437

2.328

6.765

2

0.266

1.218

1.484

3

0.265

1.188

1.453

4

0.265

1.219

1.484