In the exploration of quality, it is important to understand where software testing came from, where it is today, and where it is heading. We can then compare this trajectory to the goal of ensuring quality and see whether a correction is necessary or if we’re going the right direction.
I have been involved in software testing for the past 16 and one half years. To give you some perspective, when I started at Microsoft, we were just shipping Windows 98. This gives me a lot of perspective on the history of software testing. What I give below is my take that that history. Others may have experienced it differently.
There are three major waves of software testing and we’re beginning to approach a 4th. The first wave was manual testing. The second wave was automated testing. The third wave was that of tooling. It is important to note that each wave does not fully supplant the previous wave. There is still a need for manual testing even in the tooling phase. There is a need for automated testing even in the coming 4th wave.
The first wave was manual testing. Sometimes this is also called exploratory testing. It is often carried out by people carrying the Quality Assurance (QA) or Software Test Engineer (STE) title. Manual testing is just what it sounds like. It is people sitting in front of a keyboard or mouse and using the product. In its best form it is freeform and exploratory in nature: a tester trying to understand the user and carrying out operations trying to break the software. This is where the lore of testing comes from. The uber-tester who can find the bug no one else can imagine. In its worst form, this is the rote repetition of the same steps, the same levels, over and over again. This is also the source of legends, but not good ones. At its best, this form of testing is highly connected to quality. It is all about the user and his (or her) experience with the product.
Manual testing can produce great user experiences. As I understand it from friends who have gone there, this is the primary method of testing at Apple. The problem is, manual testing doesn’t scale. It can also be mind-numbing. In the era of continuous integration and daily builds, the same tests have to be carried out each day. It becomes less about exploring and more about repeating. Manual testing is great for finding the bugs initially, but it is a terribly inefficient regression testing model.
It gets even worse when it comes time for software maintenance. At Microsoft, we support our software for a long time. Sometimes a really long time. Windows XP shipped in 2001 and is just now becoming unsupported. Consider for a moment how many testers it would take to test XP. I’ll just make up a number and say it was 500. It wasn’t, but that’s good enough for a thought exercise. Every time you release a fix for XP, you need 500 people to run through all the tests to make sure nothing was broken. But the 500 original testers are probably working on Vista so you need an additional 500 people for sustained engineering. Add Windows 7, Windows 8, and Windows 8.1 and you now need 2,500 people testing the OS. Most are running the same regression tests every time which is not exciting so you end up losing all your good people. It just doesn’t work.
This leads us to the second wave. The first wave involved hundreds of people pressing the same buttons every day. It turns out that computers are really good at repetitive tasks. They don’t get bored and quit. They don’t get distracted and miss things. Thus was born test automation. Test automation in its most basic form is writing programs to do all of the things that manual testers do. They can even do things testers really can’t. Manual testing is great for a user interface. It’s hard to manually test an SDK. It turns out that it is easy to write software that can exercise APIs. This magic elixir allowed teams to cover much more of the product every day. We fully drank the Kool-Aid.
We set off to automate everything. There was nothing automation couldn’t do and so all STEs were let go. Everyone become a Software Design Engineer in Test (SDET). This is a developer who, rather than writing the operating system, writes tests for the automation. Some of this work is mundane: calling the Foo API with a 1, a 25, and a MAX_INT. Other parts can be quite challenging. Consider how you would test the audio playback APIs. It is not enough to merely call the APIs and look at the return code. How do you know the right sound was played, at the right volume, and without crackling? Hint: it’s time to break out the FFT.
Not everything is kittens and roses in the world of automation. Machines are great at doing what they are told to do. They don’t take breaks or demand higher pay. However, they do only what they are told to do. They will only report bugs they are told to look for. One of my favorite bugs to talk about involved media player. In one internal build, every time you clicked next track on a CD (remember those things?) the volume would jump to maximum. While a test could be concocted to look for this, it never would be. Test automation happily reported a pass because indeed the next track started playing. It turns out that once you have run a test application for the first time, it is done finding new bugs. It can find regressions, but it can’t notice the bug it missed yesterday.
The points toward the second problem with test automation. It requires very complete specifications. Because the tests can’t find any bugs they weren’t programmed to find, they need to be programmed to find everything. This requires a lot more up-front planning so the tests can cover the full gamut of the system under test. This heavy reliance upon specifications begins to distance testing from the needs of the user and thus we move away from testing quality and toward testing adherence to a spec.
The third problem with test automation is that it can generate too much data. Machines are happy to churn out results every day and every build. I knew teams whose tests would generate millions of results for each build. Staying on top of this becomes a full time job for many people. Is this failure a bug in the test? Is it a bug in the product? Was it an environmental issue (network down, bad installation, server unavailable)?
One other problem that can happen is that the work can grow faster than test developers can keep up. It is easy for a developer to write a little code which creates a massive amount of new surface area. Consider the humble decorator pattern. If I have 4 UI objects in the system, the tester needs to write 4 sets of tests. Now if the developer creates a decorator which can apply to each of the objects, he only has to write one unit of code to make this work. This is the advantage of the pattern. However, the tester has to write 4 sets of tests. The test surface is growing geometrically compared to the code dev is writing. This is unsustainable for very long.
This brings us to the third wave of testing. This wave involves writing software that writes tests. I call this the tooling phase. Rather than directly writing a test case, it is possible to write a tool that, given some kind of specification, can emit the relevant test cases automatically. Model Based Testing is one form of this tooling. The advantage of this sort of tooling is that it can adapt to changes. Dev added one decorator to the system? Add one new definition in your model and tests just happen.
There are some downsides to the tooling approach. In fact, there are enough downsides that I’ve never experienced a team that adopted it for all or even most of their testing. They probably exist, but they aren’t common. At most, this tooling approach was used to supplement other testing. The first downside is the oracle problem. It is easy enough to create a model of the system under test and generate hundreds or thousands of test cases. It is another thing entirely to understand which of these test cases pass and which fail. There are some problem domains where this is a tractable problem. Each combination or end state has an easily discernible outcome. In others, it can be exceptionally difficult without re-creating all of the logic of the system under test. The second is that the failures can be very hard to reason about in terms of the user. When the Bar API gets this and that parameter while in this state, it produces this erroneous result. Okay. But when would that ever happen in the real world?
Tooling approaches can solve the static nature of testing mentioned above. Because it is mathematically impractical to do a complete search of the state space of any non-trivial application or API, we are always limited to a subset of all possible states for testing. In traditional automation, this subset it fixed. In the tooling approach, the subset can be modified each time with random seeds, longer exploration times, or varying weights. This means each run can expose new bugs. This can be used to good effect. Given some metadata about an API and rules on how to call it, a tool can be created to automatically explore the API surface. We did this to good effect in Windows 8 when testing the Windows Runtime surface.
Sometimes it can have unexpected and even comical outcomes. I recall a story told to me my a friend. He wrote a tool to explore the .Net APIs and left it to run overnight. The next morning he came in to reams of paper on his desk. It turns out that his tool had discovered the print APIs and managed to drain every sheet of paper from every printer in the building. At Microsoft every print job has a cover sheet with the alias of the person doing the printing so his complicity was readily apparent. Some kind soul had gathered all of his print jobs and placed them in his office.
The tooling approach to testing exacerbates two of the problems of automation. It creates even more test results which then have to be understood by a human. It also moves the testing even further away from our definition of quality. Where is the fitness for a function taken into account in the tooling approach?
There is a problem developing in the trajectory of testing. We, as a discipline, have moved steadily further from the premise of quality. We’ll examine this in more detail in the next post and start considering a solution in the one after that.