Test failures


Bruised For more than a decade, Microsoft (and the whole software industry) has been evolving how it tests software. The goals have been to increase quality, reduce cost, and be more responsive to the market. The areas of focus have been automation, telemetry, changing the role of testers, and reducing the number of testers. So, how’s it going? Not great.

We’ve certainly increased automation, enhanced telemetry, and changed the role and count of testers. For some teams, this has substantially enhanced responsiveness, reduced costs, and even increased quality. However, other teams have struggled, and customers have experienced embarrassing issues that presumably would have been caught by good fundamental testing.

Do we need to bring back the old ways of testing? No; for better or worse, that time has passed. Even teams that used to have more testers than developers shipped embarrassing bugs to customers. Now, some teams are responsive to customers daily, which simply wasn’t possible with week- or month-long test passes. The question isn’t, “Do we need to go back to old-world testing?” The question is, “How do teams achieve high quality in the new world?”

How did I get here?

When I joined Microsoft 20 years ago, there were three primary engineering disciplines: program management (PM), development, and test. The test team was typically about the same size as the dev team. While dev teams wrote some unit tests, and test teams automated system tests, much of the testing was manual. The belief was that using the product like a customer was the only way to validate the actual customer experience adequately. Since there are a wide variety of customers, teams needed a wide variety of test cases, so test passes were long and remarkably detailed.

Over the next 10 years, reliance on automation increased in order to make test passes faster and more repeatable. This effort reached its pivotal moment at Microsoft in 2005, when the Test Leadership Team changed the software test engineer (STE) role to software design engineer in test (SDET). Groups also started experimenting with dev to test ratios, continuous integration, test-driven development, and combined engineering teams (developers and testers on the same team).

Unfortunately, writing robust and reliable automation is difficult, especially when the product being tested is constantly changing. Also, automation often fails to validate the actual customer experience, allowing some bugs to avoid detection. (The old-timers were right!) However, instead of bringing back old testers, companies across the industry augmented automation with instrumented customer previews—telemetry and analytics. Even Windows, Microsoft’s oldest and largest group, decided in 2013 to change SDETs to software engineers who use telemetry and analytics to ensure the effectiveness of designs and the quality of customer scenarios. That’s quite a shift.

Eric Aside

Read How we test software at Microsoft for details on older methods of Microsoft testing.

Also, I should note that some teams did bring back manual testers (often vendors) for legacy areas that were particularly difficult to automate and instrument, and many SDETs changed their focus to development. All developers need to recommit to unit testing and component testing, which they should have been doing all along, but now those tests carry greater weight.

Where did we go wrong?

To understand where the shift to telemetry and analytics has failed to detect embarrassing product problems, you need to return to first principles. Remember, the old-timers were right: The only way to detect customer issues adequately is to use the product like a customer (including eccentric, careless, and disagreeable customers). Relying on data from customer previews should work in theory, but in practice it can go wrong in seven ominous ways.

Audience: Is the preview audience a fair reflection of the full audience? If it’s too small to cover the broad range of usage patterns or biased in favor of certain usage patterns, the usage data will have holes that bugs can easily crawl through.

Usability: Is the preview code good enough to use for key customer scenarios across the entire preview audience? If the code isn’t usable, preview customers won’t stick with it, and the information you receive will be inadequate or skewed.

Frequency: Is your team releasing code to preview audiences frequently enough that it can analyze feedback, debug, fix, and re-release its code many times before general release? Just like the traditional code-test-fix cycle, using customer previews for testing requires frequent iteration.

Measures: Are measures of scenario success and failure being captured? If the outcomes of scenarios aren’t captured, good or bad, then your team can’t detect problems and improve. This is the most common mistake people make—they don’t think through the measures of success (and failure), don’t know what data to capture, and don’t write the proper telemetry and feedback mechanisms.

Visibility: Are the measures of success and failure visible and apparent to the engineering team? All the pretty charts in the world won’t do any good without the engineering team seeing and understanding the data frequently (daily or weekly).

Repair: Are the issues found actually repaired? Teams sometimes ignore rather than investigate data that seems rare or can’t be easily reproduced. Remember, every member of the preview audience represents thousands or even millions of members of the general audience. Ignore failures at your peril.

Acknowledgement: Does your organization acknowledge your preview audience members, letting them know the value of their contributions and the improvements you’re making as a result? Since preview audiences are typically made up of volunteers, and your data is only as good as their engagement, you need to ensure they feel heard, important, and powerful.

The fix is in

Most of the common problems with using customer previews for issue detection are straightforward to resolve.

  • Audience too small or biased? Expand or otherwise change the audience. If your team can’t alter the audience due to confidentiality or some other issue, you’re going to need dedicated testers.
  • Not providing acknowledgement to your preview audience? Track improvements in a weekly newsletter. Maybe even let people vote on their top ideas and issues.
  • Frustrated by failure data that seems rare or is difficult to reproduce? Enhance your telemetry to pinpoint the failure and provide you the information you need. Does that sound too hard? Work for our competitors.
  • Analytic charts not frequently viewed? Review them at daily standup, put them on a monitor in the hall, or send out regular reports.
  • Only releasing two or three previews? It’s hard to increase your team’s cadence, but you have to do it, or resort to rehiring a testing staff.
  • Didn’t think through the measures of success and failure? Stop being an idiot. Yeah, it’s hard—being a good engineer is hard. Learn, grow, and be great at measurement.

The trickier problem is usability—it seems like a catch-22. After all, the purpose of the customer preview is to find bugs, but you can’t release the preview until it’s sufficiently bug-free to be usable. How does your team get started?

Trilogy of the rings

To bootstrap usability, your team releases to preview audiences in rings. As product issues are found and resolved, you publish to larger and larger rings.

Ring zero: A handful of customers who are involved in usability studies, while the product is still under design. This audience is often neglected or even skipped, which is tragic since the best time to validate usability is before the product is built.

Ring one: Product team members who “dogfood” early builds and fix enough initial issues to make the product viable for customer validation.

Ring two: Self-selected power users who are understanding and patient enough to deal with some problems, while enjoying the status and advantage that comes with early information. Once serious issues of data loss, security, and reliability are resolved, the product should be ready for the next ring.

Ring three (and more): Enthusiasts who want the privilege and prestige of being early adopters, but can’t tolerate data loss or other serious issues associated with earlier rings. Depending on the size of the customer base, there might be four or even five rings in order to catch issues with rare configurations or usage patterns.

Eric Aside

Read more about customer previews in Quality is in the eye of the customer and using preview data to drive better products in Data-driven decisions.

What do we do now?

For engineers who’ve been around a while, the new world of less manual testing and more customer previews may seem quite different. Instead of improving designs and code before customers see it, teams are improving the product after release. But when you realize that preview audiences are basically acting like a diverse and distributed testing team, the iteration cycle is pretty similar to what we’ve always done—especially on teams that kept quality high throughout the cycle.

What is different is the role of testers. Instead of being heavy users and customer proxies, testers today are being asked to ensure our customer previews actually result in high quality products, with focus on the following activities:

  • Confirm the preview audience has the right number and mix of members.
  • Ensure the product hits the right level of usability for each ring, including testing the initial two rings along with the product team.
  • Raise concerns when previews are too infrequent, and help drive a fast cadence. (All team members should do this.)
  • Help define metrics for scenario success and failure, and validate that the right telemetry is in the code.
  • Do the analysis, produce the reports, and ensure that actionable data is always available, understandable, and in the face of the product team.
  • Ensure the top concerns and suggestions from the customer preview audience are acknowledged and receiving attention. (Again, a good thing for all team members.)

Remember, testing hasn’t gone away. It’s become more central and inclusive than ever.

Eric Aside

I talk more about the evolution and value of testing in Test don’t get no respect.

A better place

Unit testing and automation still have their essential places in validating code check-ins, ensuring safe code movement across branches, and providing a good starting point for the first ring of preview. In addition, there will always be aspects of our products and services that require special consideration regarding privacy, confidentiality, security, power consumption, performance, reliability, and scale. These aspects typically require dedicated testing, just like they did in the old days.

However, the bulk of our products are best tested by real customers in real environments doing real activities. For the customers’ benefit and ours, customer previews are the most efficient and effective means of assuring quality. Millions of our customers are happy to help shape and improve our products. It’s our job as engineering teams to respect their contributions, be responsive to their feedback, and use their input to produce high-quality products with engaging and delightful experiences.

And if you happen to be a longtime tester who is nostalgic about the old days, take heart in the growing maturity and modernization of your field. While you may need some new skills in data science, metrics, and continuous delivery, your old skills at smelling dicey components and areas, knowing how systems can fail, and guiding a strong testing effort are as invaluable as ever. The growth of your field corresponds to personal growth and ever-increasing value and quality for our customers.

Comments (1)

  1. Matt Gertz says:

    It’s certainly been the biggest culture change I’ve been involved with here in over 21 years. Part of it is everything you articulate above, and another part of it is remembering in the heat of the moment (for we are never outside the heat of the moment anymore) to actually do just that. It’s easy to ask “So, who’s going to make sure this gets tested thoroughly?” but seemingly harder to say “Making sure this gets tested thoroughly is my responsibility.”

    It’s a fractal problem. Writing code that works on your machine is a fixed cost; validating that code against the sea of machine platforms, configurations, and operating systems is nightmarishly larger — the matrix is multidimensional, and gets larger when convolved with other features from other teams. Leveraging a customer pool as you mention is a very good thing to help cover that. But the biggest challenge, IMO, is just being realistic about what your team can deliver at high quality. If you insist on wedging more new features into a release and sacrifice testing rigor to meet that goal, don’t be shocked when your various testing rings report back that all of your features are mediocre. We indulged ourselves in that in the past by having a waterfall-type stabilization period that was just as long (or longer) than the coding phase, wherein QA did its thing and devs then had to mentally page themselves backward six months in order to address what they found. Nowadays: no QA, no waterfall, just regular deliveries and no way to escape increasing quality debt if you get carried away with features and don’t pay attention to the test plan. It forces some tough cuts; it forces you to focus on the highest priority issues here and now.

Skip to main content