Hi Jon DeVaan here.
Steven wrote about how we organize the engineering team on Windows which is a very important element of how work is done. Another important part is how we organize the engineering project itself.
I’d like to start with a couple of quick notes. First is that Steven reads and writes about ten times faster than I do, so don’t be too surprised if you see about that distribution of words between the two of us here. (Be assured that between us I am the deep thinker :-). Or maybe I am just jealous.) Second is that we want do want to keep sharing the “how we build Windows 7” topics since that gives us a shared context for when we dive into feature discussion as we get closer to the PDC and WinHEC. We want to discuss how we are engineering Windows 7 including the lessons learned from Longhorn/Vista. All of these realities go into our decision making on Windows 7.
OK, on to the tawdry bits.
Steven linked last time to the book Microsoft Secrets, which is an excellent analysis of what I like to call version two of the Microsoft Engineering System. (Version one involved index cards and “floppy net” and you really don’t want to hear about it.) Version two served Microsoft very well for far longer than anyone anticipated, but learning from Windows XP, the truly different security environment that emerged at that time and from Longhorn/Vista, it became clear that it was time for another generational transformation in how we approach engineering our products.
The lessons from XP revolve around the changed security landscape in our industry. You can learn about how we put our learning into action by looking at the Security Development Lifecycle, which is the set of engineering practices recommended by Microsoft to develop more secure software. We use these practices internally to engineer Windows.
The comments on this blog show that the quality of a complete system contains many different attributes, each of varying importance to different people, and that people have a wide range of opinions about Vista’s overall quality. I spend a lot of time on core reliability of the OS and in studying the telemetry we collect from real users (only if they opt-in to the Customer Experience Improvement Program) I know that Vista SP1 is just as reliable as XP overall and more reliable in some important ways. The telemetry guided us on what to address in SP1. I was glad to see one way pointed out by people commenting about sleep and resume working better in Vista. I am also excited by the prospect of continuing our efforts (we are) using the telemetry to drive Vista to be the most reliable version of Windows ever. I add to the list of Vista’s qualities successfully cutting security vulnerabilities by just under half compared to XP. This blog is about Windows 7, but you should know that we are working on Windows 7 with a deep understanding of the performance of Windows Vista in the real world.
In the most important ways, people who have emailed and commented have highlighted opportunities for us to improve the Windows engineering system. Performance, reliability, compatibility, and failing to deliver on new technology promises are popular themes in the comments. One of the best ways we can address these is by better day-to-day management of the engineering of the Windows 7 code base—or the daily build quality. We have taken many concrete steps to improve how we manage the project so that we do much better on this dimension.
I hope you are reading this and going, “Well, duh!” but my experience with software projects of all sizes and in many organizations tells me this is not as obvious or easily attainable as we wish.
Daily Build Quality
Daily quality matters a great deal in a software project because every day you make decisions based on your best understanding of how much work is left. When the average daily build has low quality, it is impossible to know how much work is left, and you make a lot of bad engineering decisions. As the number of contributing engineers increases (because we want to do more), the importance of daily quality rises rapidly because the integration burden increases according to the probability of any single programmer’s error. This problem is more than just not knowing what the number of bugs in the product is. If that were all the trouble caused then at least each developer would have their fate in their own hands. The much more insidious side-effect is when developers lack the confidence to integrate all of the daily changes into their personal work. When this happens there are many bugs, incompatibilities, and other issues that we can’t know because the code changes have never been brought together on any machine.
I’ve prepared a graph to illustrate the phenomenon using a simple formula predicting the build breaks caused by a 1 in 100 error rate on the part of individual programmers over a spectrum of group sizes (blue line). A one percent error rate is good. If one used a typical rate it would be a little worse than that. I’ve included two other lines showing the build break probability if we cut the average individual error rate by half (red line) and by a tenth (green line). You can see that mechanisms that improve the daily quality of each engineer impacts the overall daily build quality by quite a large amount.
For a team the size of Windows, it is quite a feat for the daily builds to be reliable.
Our improvement in Windows 7 leveraged a big improvement in the Vista engineering system, an investment in a common test automation infrastructure across all the feature teams of Windows. (You will see here that there is an inevitable link between the engineering processes themselves and the organization of the team, a link many people don’t recognize.) Using this infrastructure, we can verify the code changes supplied by every feature team before they are merged into the daily build. Inside of the feature team this infrastructure can be used to verify the code changes of all of the programmers every day. You can see in the chart how the average of 40 programmers per feature team balances the build break probability so that inside of a feature team the build breaks relatively infrequently.
For Windows 7 we have largely succeeded at keeping the build at a high level of quality every day. While we have occasional breaks as we integrate the work of all the developers, the automation allows us to find and repair any issues and issue a high quality build virtually every day. I have been using Windows 7 for my daily life since the start of the project with relatively few difficulties. (I know many folks are anxious to join me in using Windows 7 builds every day—hang in there!)
For fun I’ve included a couple pictures from our build lab where builds and verification tests for servers and clients are running 24x7:
Whew! That seems like a wind sprint through a deep topic that I spend a lot of time on, but I hope you found it interesting. I hope you start to get the idea that we have been very holistic in thinking through new ways of working and improvements to how we engineer Windows through this example. The ultimate test of our thinking will be the quality of product itself. What is your point of view on this important software engineering issue?