My name is Alex Thaman and I am a Senior Test Lead on the Visual C++ compiler team at Microsoft. The focus of this blog is testing of the compiler backend where I’ve spent a good portion of my time here. For those not aware, this is the part of the compiler that takes an intermediate representation as an input, does optimizations and code generation.
I will walk you through the compiler backend testing domain, the kinds of bugs that the backend compiler deals with and how we go about testing the backend compiler.
The Compiler testing domain
Compiler testing exists in a different domain than many other kinds of application or system testing. Here are a few attributes of compilers that inform how we think about testing:
· Compilers have short-lived execution times like most other command-line tools. By “short” I mean that it runs, does some work, outputs some files, and exits.
· There is no user interaction during execution.
· Compilers execute in phases, where at each phase a transformation is applied to the input and the output becomes the input of the next phase which makes each phase interdependent. Also this means it can be difficult to construct test cases that to reach specific code paths, especially in the later compiler phases.
· Many compilers do analysis at the entire program level, so data about the entire input set may be sitting in memory and may all be operated upon at once
· With some exceptions, compiler outputs cannot easily be verified by inspection or other kind of test tool. They need to actually be executed. One reason is that there are many correct outputs in terms of machine instructions, another is that the output can change from day to day and still be correct, and lastly the output is very large compounding the first two issues I mentioned.
Below are the categories of bugs that we deal with on a day-to-day basis in the compiler:
· Compiler crashes (also known as an ICE or Internal Compiler Error) – This is simply some kind of failure during execution of the compiler
· Compiler hangs – Some kind of infinite loop in the compiler. Because the compiler back-end is single-threaded, there is no possibility of application-level deadlocks
· Incorrect error/warning output – This can be either an error/warning that fired when it should not have, or an error/warning that should have fired but did not. The latter case is interesting because it is often not so much a bug but a limitation in the feature, at least in the case of warnings. Some warnings require some extensive code analysis to fire in all the cases that they should. We do make efforts to ensure that we are giving customers the best possible information when they have done something incorrect
· Bad code generation – This is a result of incorrect compiler output and is by far the most devastating bug of any kind. There are two classes of bad code generation:
a) Bad code generation that leads to a crash in the application – These are the less problematic of the two cases because of the effects and due to the ease of discoverability. The effect is an application crash which in many cases is resolved with a restart and does not corrupt data. It is (typically) easy to pinpoint the bad code generation because the crash gives you a call stack and you can see in front of you what got corrupted.
b) “Silent” bad code generation – These are the worst kind of bugs not only because these bugs can result in data loss but tracking them down in some cases is extremely difficult because it’s not always easy to find what got corrupted and where it got corrupted. You can imagine that this problem is worse for a multithreaded app – we have seen silent bad codegen bugs where a variable’s volatile attribute has not been honored in a certain loop. Sometimes silent bad codegen is a result of what should have been a crash due to overwriting invalid memory, but a memory location happens to contain valid data. An example would be if you set i = 100 but instead of writing to i, it overwrites j. This is just an example – it never manifests in this easy of a form
· Compiler throughput issues – Issues that affect the amount of time the compiler takes to compile code
· Code quality issues – Issues that affect the performance of the compiled application
· Compiler feature correctness issues – This class of bugs involves the compiler generating correct code, but not doing what a particular feature specifies should be done. An example here would be not adding a security cookie when the /GS switch is passed. In this case, the code would execute just fine but would not have the same buffer overrun security protection that the user would expect
· Other peripheral behavioral issues – There are things related to compiling that can be affected by compiler bugs. The biggest example is debugging information, where the result is typically seen in the form of the debugger not doing what you would expect
One last thing to note is that the most interesting testing space for the backend is specifically optimized code. A non-optimizing compiler does not do all that much work, and though we do test the /Od compilation, we don’t spend a lot of time with it and it generates far fewer bugs, and the same goes for monitoring /Od build times and /Od code generation. Note that Debug build times in an end-to-end scenario are measured quite often, but the compiler backend generally contributes only a small portion to this.
How Do We Test?
Now that I’ve explained the compiler testing space and the kinds of bugs we deal with, I’ll explain how we actually test the backend.
We create A LOT of tests, really A LOT of tests J. We’re talking on the order of hundreds of thousands of small tests. To understand why, try to think of how you might test exception handling (EH). You might come up with simple cases involving a test throwing an exception and catching it, throwing and not catching, simple nested exceptions, etc. These are pretty basic, and would constitute tests that a developer would run before every check-in just to make sure the product works at a fundamental level. We also have a much larger suite of tests to verify that our compiler EH code generation is ready for production. Without going into too many details, we have to ensure that throwing of various kinds of objects (including ones with copy constructors and destructors that get called during stack unwinding), dealing with weird control flow around EH (what happens when you have a goto from a handler to outside the try?), etc. all works as the user would expect. You can see that this matrix of cases can explode. I have not actually counted but I would guess that we have a few thousand tests that involve EH.
To add to this matrix, the compiler also has a few switches that have big effects on what is done with the code. The most interesting “set” of switches is /Od, /O1, /O2, /O2 /GL, /O2 /GL /link /ltcg:pgu. We run this matrix as a permutation of most of our tests. We have many other switches, but those are tested in a more localized fashion since applying them broadly to most tests is not as interesting as the optimization controls. The last big dimension of our matrix is /clr – almost all of the tests that we have that are supported under /clr are tested with this switch as well.
Given the infinite set of inputs, there isn’t a systematic way to test everything in the compiler in an efficient way. However the C++ compiler has a big advantage in terms of testing – people have already written test cases for us! Anyone who writes code has a test case. As a result we rely heavily on what we call “real world code” (RWC), which are just real under-development applications. You can be assured that the C++ compiler we ship to you in Visual Studio 2010 has already successfully built Visual Studio itself, Windows, SQL, Office, and many other large software applications. Another advantage is that this code is under active development, which means that every day is a new test case as the developers churn on the code. We frequently release our compiler to internal developers and fix the bugs that we get from these developers. What is released in Visual Studio 2010 has been through an extensive grind within Microsoft.
Performance testing is a critical part of what we do. For the optimizing backend, this tells us how well our features are working. Because the output of the compiler involves performance, there are actually three forms that performance takes on the backend team.
1. The first is what we call “code quality”. This is typically the speed of the code that is generated. We use a diverse set of performance tests to cover various kinds of code that we might be compiling in the real world. In most cases we are hyper-sensitive to noise, and even 1% noise on these benchmarks makes it difficult for us to see how we are doing, because true performance regressions can often come in 1% increments.
2. The second is “code size”. This is actually closely tied to code quality. Code size alone is only somewhat interesting in that we don’t want to generate very large images, but it often correlates to code quality. Optimizations will often trade more code for faster code, the two easiest examples being the inliner (which will reduce call overhead and provide additional optimization opportunities) and the loop unroller. One disadvantage we have in measuring code quality is that it requires execution of the code in question. Code size can be measured by just building. We will not always try to reduce code size for every change we make, but we will watch it as an indicator of how we are changing things. We will typically measure code size on the benchmarks that we already use for code quality plus some of our real-world code.
3. The third is “throughput”. This is a measure of how fast the compiler runs, i.e. the build time. This is most interesting to watch when optimizations are turned on because this constitutes the bulk of the execution time in the compiler. We realize that build time is important to you and this we keep an eye on this metric to make sure that you remain productive.
With all of that said, because there are just so many code patterns and possibilities, we still can’t catch everything with those efforts alone. This requires us to start getting more creative with the testing. One area that has shown the most promise is running compiler stress modes. There are two classes of stress modes that we have:
1. Ones that actually change the input in some way, such as add a try/finally around the body of every function, mark every variable as volatile, etc.
2. Ones that change heuristics for optimizations/analysis, such as inline every function instead of ones that give a benefit.
Stress modes are extremely effective in taking existing tests we do have (including real world code) and creating new and interesting cases out of them. There are two main challenges with stress modes:
1. Not increasing the size of the matrix too tremendously, so specifically determining what tests should be run with each stress mode.
2. Making sure a stress mode does something that is legal. That is, a test run under a stress mode might fail but it is because the stress mode did something that makes the code incorrect. Depending on the case, it could be considered a bug in the stress mode or it could be an incompatibility between the test and the stress mode.
These aren’t major barriers, just things to keep in consideration as it requires that we be selective in which tests we want to use with our stress modes.
Auto Test Generation
In certain cases, we can use test case generation tools to assist us in testing parts of the compiler. One of our team members created a generator for exception handling tests. Because the matrix is so large for exception handling, and because the cases are easy to construct by modeling the EH code as a tree, we were able to create many different cases on the fly by generating trees where the nodes involved various constructs that later turned into C++ exception handling