Rejuvenating the Microsoft C/C++ Compiler

Our compiler is old.  There are comments in the source from 1982, which was when Microsoft was just starting its own C compiler project.  The comments of that person (Ralph Ryan) led me to a paper he published in 1985 called “The C Programming Language and a C Compiler”.  It is an interesting read and some of what he describes is still reflected in the code today.  He mentions that you can compile C programs with two floppy drives and 192K of RAM (although he recommends a hard drive and 256K of RAM).  Being able to run in that environment meant that you couldn’t keep a lot of work in memory at a time.  The compiler was designed to scan programs and convert statements and expressions to IL (intermediate language) as quickly as possible and write them to disk without ever having an entire function in memory at one time.  In fact, the compiler will start emitting IL for an expression before even seeing the end of the expression.  This meant you could compile programs that were quite large on a pretty small machine.

Note: Our compiler consists of two pieces (a front-end and a back-end).  The front-end reads in source code, lexes, parses, does semantic analysis and emits the IL.  The back-end reads the IL and performs code generation and optimizations.  The use of the term “compiler” in the rest of this post pertains only to the front-end.

For C code (especially K&R C), this approach worked well.  Remember, you didn’t even need to have prototypes for functions.  Microsoft added support for C++ in C 6.07.0, which was released in 19891992.  It shared much of the same code as the C compiler and that is still true today.  Although the compiler has two different binaries (c1.dll and c1xx.dll) for C and C++, there is a lot of source code that is shared between them. 

At first, the old design of the compiler worked OK for C++.  However, once templates arrived, a new approach was needed.  The method chosen to implement this was to do some minimal parsing of a template and then capture the whole template as a string of tokens (this is very similar to how macros are handled in the compiler).  Later, when a template is instantiated, that token stream would be replayed through the parser and template arguments would be replaced.  This approach is the fundamental reason why our compiler has never implemented two phase lookup.

The design of our compiler also made it unsuitable for other purposes where you wanted to retain more information about a program.  When we added support for static analysis (/analyze) in the compiler, it was added to the same code base as the actual compiler, but the code was under #if blocks and we generated separate binaries (c1ast.dll and c1xxast.dll).  Over time, this resulted in more than 6,000 #if preprocessor blocks.

The static analysis tools built an AST for an entire function by capturing pieces as the regular compiler does its parsing.  However, this captured AST is fundamentally different from what the real compiler uses for its data structures, which often lead to inconsistencies.  Also, as new language features were added, most had to be implemented twice: once for the compiler and again for static analysis.

About three years ago we embarked on a project to finally perform a major overhaul of our compiler codebase.  We wanted to fix problems we have had for a long time and we knew new features such as constexpr were going to need a different approach.  The goal was to fundamentally change the way our compiler parses and analyzes code.

We quickly decided on a few key tenets to guide our development.  The most important tenet is that all rejuvenation work that we do will be done in the same development branch as features.  We don’t want to “go dark” and have two divergent codebases that are difficult to reintegrate.  We also want to see value quickly, and in fact, we need value quickly.

The first phase of this work has finally shipped in Visual Studio 2015.  We have changed a lot of the guts in the compiler’s internal implementation, although not much is directly visible.  The most visible change is that c1ast.dll and c1xxast.dll are no longer present.  We now handle all compilation for static analysis using the same binary as the one we do for code generation.  All 6,000+ #if blocks are gone and we have less than 200 runtime checks for analysis.  This large change is why code analysis was disabled in some of the RC builds of the C++ compiler as we ripped out the #if blocks and then had to build the new infrastructure in its place.

The result of this is that we now generate a full tree for functions and can use that same data structure to generate code or to perform static analysis.  These same trees are used to evaluate constexpr functions as well, which is a feature we just shipped.  We also now track full source position information (including column) for all constructs.  We aren’t currently using column information but we want to be able to provide better diagnostics in the future.

As we make these changes, we strive to provide as much backward compatibility as we can while fixing real bugs and implementing new features in our compiler.  We have an automated system called Gauntlet that consists of over 50 machines that builds all versions of the compiler and runs many tests across all flavors of 32bit, 64bit, and ARM architectures including cross compilers.  All changes must pass Gauntlet before being checked in.  We also regularly run a larger set of tests and use our compiler on “real world code” to build Visual Studio, Office, Windows, Chrome, and other applications.  This work flushes out additional compatibility issues quickly.

Looking forward, we are continuing to invest in improving our compiler.  We have started work on parsing templates into an AST (abstract syntax tree) and this will yield some immediate improvements in our support for expression SFINAE and our parsing of “qualified names”.  We will continue to invest in improving our compiler with a goal towards making it fully standards compliant.  That said, we are also very interested in improving our support for Clang as well.  In fact, there is a presentation at CppCon on using the Clang front-end with our code generator and optimizer.  Here is the link to that session.

–Jim Springfield