It breaks my heart and sickens my stomach to witness the tremendous productivity and quality gains of Lean Software Development practices at Microsoft: feature crews in Office, scrum teams in Xbox, and improvement teams in SQLServer, to name a few. These Lean approaches yield less-incomplete work, higher-quality builds throughout the product cycle, earlier feedback on features, and timelier cross-discipline communication.
Feature crews are small, cross-discipline teams each tasked with a single feature, or closely related small features, to design, spec, develop, and test together from start to finish. Improvement teams are quite similar to feature crews, but have a different name to emphasize their work on infrastructure improvements, as well as features.
Why is my heart broken by greater Lean productivity? Why is my stomach in knots over higher Lean quality? Because there’s an insidious, deceptive, soul-crushing beast lurking in many of Microsoft’s Lean Software Development efforts, slowing them down, driving poor engineering practices, lengthening our ship cycles, and reducing the feature set we deliver to customers. What is this horrifying, hellacious, heathen of a heartbreaker? It goes by the name “stabilization.”
Stabilization periods between development milestones don’t just add superfluous time to schedules. Stabilization periods also make products worse with fewer features and lower quality, while also dragging down morale and encouraging crappy code. Oh, my forlorn heart. Oh, my aching stomach.
We don't need you anymore
Legacy McLoser says, “You’ve lost it this time, Mr. Wright. Stabilization periods are essential for integration quality.” No, they’re not—at least not when using Lean methods, like feature crews, improvement teams, scrum, or kanban.
In traditional Waterfall methodology, the codebase isn’t functional or complete during development periods. Stabilization between development periods provides an opportunity to run integration and system testing, fix bugs, and get the codebase close to a shippable state. Without stabilization periods, integration and system issues build up to the point that past Microsoft projects have failed or slipped years. Thus, stabilization is required when running traditional Waterfall.
In Lean methods, a cross-discipline team works on a feature from start to finish. The feature enters the codebase integrated, complete, and fully tested (or at least it should—more on this later). The codebase is always kept close to a shippable state.
Integration issues that occasionally arise in Lean projects resemble sustained engineering issues far more than the sustained insanity of cleanup during Waterfall stabilization. Teams should deal with sustained engineering issues as they go. Therefore, stabilization for Lean methods is unnecessary and quite harmful, as I describe below, because it demoralizes your best teams and encourages your worst.
For more on the basics of Lean Software Development, please read my 2004 column on the subject, “Lean: more that good pastrami” (chapter 2).
Shake and bake
Note that I’m only talking about stabilization periods between development periods. I’m not talking about the lengthy stabilization time at the end of a long product cycle. That “bake time” for release candidates is necessary for traditional, packaged products like Office and SQLServer to catch edge cases that take time to emerge, regardless of whether projects use a Waterfall or Lean approach.
Properly designed services can be gradually rolled out, instantly rolled back, and iteratively improved after they are released. Thus, properly designed services don’t need “bake time” at the end of a project.
Packaged products have limited opportunity to roll out gradually, instantly roll back, or iteratively improve once they are released to a wide audience. Thus, packaged products (and critical trust services, like handling credit cards) need “bake time” to minimize risk.
I talk about properly designing services with gradual rollout and instant rollback in my column, There's no place like production.
Get a clue
Stabilization periods between development milestones aren’t just unnecessary, they are harmful. Why? Imagine a feature crew committed to delivering every feature completed and fully tested. We’ll call this feature crew “the clue crew.”
What does the clue crew do during stabilization? Sure, there might be a few integration issues, but certainly not enough to keep the clue crew busy the whole time. Mostly, the clue crew fixes other team’s bugs and works secretly on more features.
Meanwhile, there’s another feature crew, “the poo crew,” which is sloppy. This crew takes on too much work and doesn’t deliver completed and fully tested features. Instead, its features are full of bugs and lack fleshed-out tests. The poo crew uses the stabilization period to clean up its mess—with the help of the clue crew.
What happens next? Naturally, management notices how the poo crew ended up completing more officially booked features than the clue crew—even though the poo crew could only finish its features with the extra few weeks and the help of the clue crew. Management chastises the clue crew for its lack of ambition and encourages it to book more work, like the poo crew, the following development period.
What’s the matter here?
Maybe you think the poo crew is the smart one. After all, it appears to be more productive than the clue crew. If you are that superficial and stupid, please take your way-back machine to the 1990s, and get back to wasting valuable time and effort producing buggy products using traditional Waterfall.
Hopefully you recognize that the poo crew wouldn’t have completed as much over the entire development and stabilization period if it had been forced to fix its own bugs and produce tested, sustainable features. Likewise, the clue crew could have produced more tested, sustainable features had it not needed to spend time fixing poo crew bugs, and instead used the stabilization time to complete more official features.
However, as long as there is a stabilization period, poo crews will have permission to leave their features incomplete and clue crews will be punished for writing quality code by being slapped with poo crew bugs and chastised for lacking ambition. In other words, stabilization periods provide the means to demoralize your best teams and encourage your worst. Soon every team is writing poo, and stabilization periods become essential in a self-fulfilling, Waterfall way.
Why is the Lean clue crew consistently more productive with higher quality than the poo crew?
- Because they test their features early, before bad designs are baked into the code.
- Because they fix bugs early, while the code is still fresh.
- Because they work tightly across disciplines, while designs are still iterating, improving communication and causing less rework.
Are we done yet?
What’s the alternative to stabilization periods? Hard definitions of “done” for features, enforced by strong release management. I talked about “done” definitions for features in Cycle time—the soothsayer of productivity. Here they are again:
- All updated designs and code are reviewed (includes security, etc.).
- All automated tests are written and passed (includes security, etc.).
- No ship-stopping bugs exist (includes unacceptable fit-and-finish).
- All monitoring and feedback is in place.
Typically, feature crews, infrastructure teams, and scrum teams will develop in a branch off the release code branch and not integrate back into the release branch with the new feature until the done definitions are met. There are two rules for working in branches:
- Never be more than one branch away from the release branch, unless you enjoy waiting weeks for fixes to integrate up and down the branches. There are several variations on being one branch away—a branch per feature, per crew, per team, or per group. Any can work if you forward integrate daily from the release branch, build your branch at least daily, and never create branches of branches.
- Never stay on a branch for more than a month, unless you enjoy merge conflicts and integration hell. If your feature is larger than a month’s worth of effort, break down the feature into a series of smaller, testable pieces, satisfy the “done” definitions for each piece, and then reverse integrate back into the release branch.
Being in a stable relationship
With proper enforcement of “done” and smart use of branches, the poo crew can’t integrate its unstable, unfinished, unsustainable ugliness into the release branch. It doesn’t get credit for features it didn’t really finish.
Meanwhile, the clue crew integrates its features regularly and spends its time adding value to the release branch instead of stabilizing some other team’s poo. There’s no need for stabilization in this model because the release branch is always stable with the latest completed features. If there is an integration issue, it’s handled like any other issue with released code: the crew responsible stops work as needed and fixes the issue (see Sustained engineering idiocy for details).
What happens to all those weeks of stabilization time between development milestones? You can either cut them from the schedule and ship earlier or use the time to create and deliver more customer value.
Hasta la vista, baby
Teams still need to plan regularly, coordinate, and synchronize. Packaged products still need significant “bake time” for release candidates at the end of a product cycle. However, stabilization’s days are numbered.
If you haven’t switched to a Lean approach, do so today. Lean is far more efficient and yields higher quality than Waterfall, it scales to huge organizations, and there are enough variations of Lean that one is sure to fit your team.
If you have switched to a Lean approach, do away with stabilization periods. By removing this ugly old vestige of inefficient, low-quality development, and by insisting on only completed features in the release branch, you make your product better, your team better, and your customers happier. May stabilization rest in peace.