Dependency tracking in builds

Happy Valentine's Day, everyone! Today I'm going to talk about the important problem of successfully tracking dependencies in builds in an automatic manner.

In short, there are two kinds of build systems: the ones where all dependencies are tracked successfully and incremental rebuilding is available at all times, and the ones where you have to rebuild everything whenever you change anything remotely significant. There is nothing inbetween - as soon as a build system fails to track dependencies in one circumstance, developers will become paranoid and insist on costly full rebuilds in the future. After all, the consequences of a partial rebuild are insidiously difficult to diagnose, resulting in broken builds, symbol mismatches, exceptions while loading assemblies, mixtures of interacting new and old code, and other nonsense that we don't want to waste our time debugging. This means that a dependency tracking system has to be very close to perfectly reliable, which can make it very difficult to design and implement successfully.

Although the problem is a very old one, with tools like GNU make dating back decades, I believe in most project environments dependency tracking is not solved to the level of reliability that is needed for this kind of trust. Behind all of these failed dependency-tracking systems is one cardinal sin:

Dependencies cannot be maintained by hand.

I'm not merely saying maintaining dependencies by hand is difficult or even error-prone, but that it's damn near impossible. In C/C++ in particular, adding a #include statement to a header file implicitly adds a dependency to every source file including that header file directly or indirectly. Even worse is removing a #include, in which case some of the files depending on it might no longer require it, but then some might, depending on what they directly or indirectly include. Waste all the time you want on this - you won't get it right. Even in C#, Java, or Visual Basic, which thankfully have no concept of header and source files and load an entire project for compilation at once, interproject dependencies can become just as complex in large solutions.

Even worse are systems which require you to explicitly specify not only the dependencies but the build order. Often hierarchically structured solutions of many projects will use pretty good dependency tracking within each project, but revert to this "build these things in this order" model for the whole tree. The result is that related modules in different parts of the tree cannot be built together incrementally, and highly parallellized building becomes impossible. Scalability demands that we create some level of abstraction for solution-level builds, but that doesn't mean we have to sacrifice incremental building altogether.

The alternative to maintaining something by hand is of course maintaining it automatically. The code is already quite explicit about what depends on what - whether in the form of #include statements, import statements, assembly references, or other means. The fundamental principle of avoiding redundancy tells us that we should be generating dependency information from these sources. For example, the following buried treasure from the GNU make manual details how to use the -M flag of gcc together with some (rather esoteric) script to automatically track header dependencies in C:

The practice we recommend for automatic prerequisite generation is to have one makefile corresponding to each source file. For each source file ``name.c' there is a makefile \``name.d' which lists what files the object file ``name.o`' depends on. That way only the source files that have changed need to be rescanned to produce the new prerequisites.

Here is the pattern rule to generate a file of prerequisites (i.e., a makefile) called ``name.d' from a C source file called \``name.c':

 
 %.d: %.c        @set -e; rm -f $@; \         $(CC) -M $(CPPFLAGS) $< > $@.$$$$; \         sed 's,\($*\)\.o[ :]*,\1.o $@ : ,g' < $@.$$$$ > $@; \         rm -f $@.$$$$

https://www.gnu.org/software/make/manual/html_mono/make.html#SEC51

The concept here is that the dependencies for each source file are stored in a separate file which itself depends on that source file. If the source file is modified, the ".d" file containing that source file's dependencies is rebuilt before it is loaded by make. In this way the dependencies can never be wrong or out of date for even a moment. This concept could be generalized to a pluggable framework that enables you to determine build-time dependencies of arbitrary files using arbitrary code.

We can play similar tricks to dynamically track the dependencies of projects in a solution. Although this is one level of abstraction higher, it's really just the same thing with source files replaced by projects. We analyze the projects to determine their dependencies at build-time, set up a build order, and then locally build each project using its own local dependency structure. This enables automatic dependency tracking to scale effectively. With a solution like this, you can simply sync up the latest changes from the repository and do an incremental build from root in minimum time.

Dependency tracking isn't just useful for builds either - it can often be used to evaluate the impact of a change. For example, if I change a header file, a good dependency tracking system should be able to name for me all the affected source files in the project, as well as any other project that might be affected. In the end I might decide to go with a less destabilizing change instead. They can also be incorporated into version control systems, where they can help to evaluate whether concurrent changes made by other people will break what you're working on right now. This is largely unexplored territory.

At Microsoft, a new build system called MSBuild has recently been developed and released with Visual Studio 2005. Currently used for almost all of the Visual Studio build, it will hopefully in time replace older ad hoc systems based on batch files across the company. One of its advantages is its incremental building support, described here:

https://msdn2.microsoft.com/en-us/library/ms171483.aspx

Although a sophisticated system, even MSBuild's support for incremental building is relatively limited, focusing on dependencies within projects between files that have direct input-to-output mappings.

Build scheduling algorithms

Let's talk a little bit about the technology behind build scheduling - a simple problem, but one with a number of interesting generalizations. To simplify the discussion, we'll refer to both inputs and targets/outputs as "nodes". Thus, source files, header files, binary outputs, object files, resource files, and so on are all nodes. We can describe this as a directed graph where each vertex is a node and each edge from A to B says that "B depends on A". In simple terms, this is just a drawing where we have a circle for each node and an arrow pointing from each node to nodes that depend on it. This is called the dependency graph, and is the fundamental data structure used in build scheduling (as well as other types of scheduling).

The simplest scheduling problem is sequential scheduling, where we have a one-processor machine and we need to figure out what order to build things in. In this case, we can use topological sorting. This just means that first we start by building all the things with no dependencies, and we proceed to build the things whose only dependencies are things we already built. There are efficient algorithms for determining this order in advance.

In large solutions, however (like say, Windows), building everything on one machine could take weeks or months. In this case, we'd like to use many machines, each with many processors, to build different parts of the product simultaneously. However, they still have to respect the build-time dependencies between components. This is where things get interesting, and we run into many of the same problems as scheduling instructions on microprocessors, and even run into some NP-complete problems. A couple common heuristics, the Decreasing-Time Algorithm and the Critical-Path Algorithm, are described here:

https://www.ctl.ua.edu/math103/scheduling/scheduling_algorithms.htm

Both of these require a time estimate that each step will take, which is not too difficult to generate automatically from metrics like lines of code. The concept behind the Critical-Path Algorithm is that if a long chain of dependencies is waiting on a particular node to be built, that's a good reason to build that node first, even if it doesn't take long to build itself. Even these heuristics are based on NP-complete problems such as the longest path problem, but in practice they can often be solved efficiently.

Note that one can't just throw arbitrary hardware at a scheduling problem; there are limits beyond which any additional hardware would simply sit idle, and there are even smaller limits for which very little benefit is derived. Part of the challenge of parallel scheduling is finding just the right number of systems to maximize cost-benefit.

There are even fancier variants of this where you have machines that build at different speeds, machines that can build some types of files but not others, and so on. I won't get into these here, but this discussion should make one thing clear: a developer doesn't want to waste precious development time figuring all this stuff out. Build order should always be determined by build technology that can be relied upon to be correct, to do efficient incremental building, and to scale up to large builds and large build clusters well, all automatically.

So why do we keep using hacked-up internal build systems made of Perl scripts and anarchy? I don't know, but if you decide to make something better, please come sell it to my group!