Sustained engineering idiocy

Plumbing channels waste water into a series of larger and larger pipes till it is expelled. That’s because sewage flows downstream, which explains the quality of goods that test, operations, and sustained engineering teams receive. After all, they are downstream of design and development.

I’ve written about pushing quality upstream for testers in “Feeling testy” (chapter 4), and making services resilient to failure for operations using the five Rs in Crash dummies. Like most engineers, I’ve neglected sustained engineering (SE), also known as the engineering sewage treatment center. No, on second thought, that analogy implies that what we release to customers has been cleansed. SE is more akin to environmental cleanup after an oil spill—thankless, difficult, and messy.

Imagine what must go through the minds of those cleanup crews as they wash oil from the feathers of sea birds. Naturally, there’s empathy for the birds (customers). There’s frustration at the inevitability of mistakes that lead to tragedy (buggy software). And there’s a palpable desire to have the jackasses who caused the spill be forced to take your place (the engineers who let the bugs slip by).

You make the call

Should the engineers who design, construct, and test the code be the same engineers who fix the bugs found after release? This is the quintessential question of SE.

If the engineers who built the release fix the post-release bugs, you typically get better fixes, the engineers feel the pain of their mistakes, and the fixes can be integrated into the next release. Then again, the next release may not happen because its engineers are being randomized.

If you have a dedicated SE team you build up knowledge of the code base outside the core engineering team, you can potentially pay a vendor to sustain old releases, and you don’t distract or jeopardize progress on new releases. Then again, SE teams get little love, their fixes can be misinformed, you duplicate effort, and the core engineering team isn’t held accountable for their mistakes.

Tough call, huh? Nope, not at all. While both models can work, having the engineers who build the release also fix post-release bugs is far better. Only idiots believe a lack of accountability leads to long-term efficiency and high quality. Of course, the world is full of idiots, but I digress.

Someone’s got to take responsibility

Yes, a dedicated SE team can work, but long term it will only cause grief for team members and customers. Why? Because you can mitigate post-release fixes distracting the core team, but you can’t mitigate the problems with a dedicated SE team.

Let’s go through those dedicated SE team problems again.

§  Little love. What would it take for the dedicated SE team to be appreciated as much as the core engineering team? A disaster, right? And what would it take on a day-to-day basis? Non-stop disasters. In other words, the conditions for loving the SE team are undesirable.

§  Misinformed fixes. To get a fix right, recognizing all the implications of changes, you need to deeply understand the impacted portion of the code base. Let’s fantasize that the core engineering team has that level of depth. The core team is always considerably larger than the SE team. The SE team has no hope of truly appreciating the impact of fixes. Reality is only worse. Sure you can have the SE team consult with the core team, but doing that all the time defeats the purpose.

§  Duplicate effort. Whenever you have two teams fixing issues in the same code you duplicate effort, by construction. You’ve got two teams learning the same code, debugging the same code, changing the same code, and testing the same code. There’s no getting around it, unless you neglect to incorporate the fixes into the next release, which is even worse.

§  Accountability for mistakes. The whole point of the dedicated SE team is to avoid derailing the core engineering team, protecting them from dealing with fixes. The core team doesn’t correct their mistakes in the old code, and doesn’t know to prevent those mistakes from recurring in the new code. What’s worse is that there’s no reinforcement of good and bad behavior. Conscientious heroes don’t get to write more quality code, while careless villains fix past mistakes. Thus, we can never expect to improve. A great recipe for joyful competitors  and sorrowful customers.

What do I do now?

In contrast, there’s plenty you can do to avoid jeopardizing future releases while the core engineering team fixes prior mistakes. Let’s run through the relentless, randomizing requests and resolve them.

§  Triviality. How do you avoid wasting the core team’s time with issues that aren’t software bugs, or have trivial workarounds? You have a small dedicated team triage the issues. Note this team isn’t a development team. It’s purely an evaluation team that determines which issues are worth fixing. That way, only worthwhile work is passed onto the core team.

§  Prioritization. How do you balance bugs fixes for the last release with work on the new release? You have the dedicated evaluation team prioritize the fixes. There are four buckets: immediate fix (the rare “call the VP now” issue); urgent fix (next scheduled update); clear fix (next service pack or update); and don’t fix. These buckets send clear signals to the core team about which bugs to fix at what time.

§  Unpredictability. How do you make inherently unpredictable post-release issues easy for the core team to schedule around? You make them regular events. Deploy one update per month. The urgent fixes each month are queued up by the evaluation team. The core team sets aside the necessary time each month and the fixes are designed, implemented, tested, and deployed on a predictable schedule. This is just as good for customers as it is for the core engineering team. Everyone likes predictability.

In addition, the evaluation team can create virtual images for easy debugging by the core team, improve the update experience for customers, and reflect customer needs and long-term sustainability features back into future releases.

Eric Aside

Of course, it isn’t as simple as a small evaluation team prioritizing issues. There’s a bunch of orchestration and system support necessary to make SE run smoothly. That part is unavoidable. What is avoidable is duplicating effort, uninformed fixes, and ignoring accountability.

This won’t hurt a bit

See, it’s not that complicated. You save on staff. You get better fixes. You catch similar issues in advance. You achieve predictability. And you ensure the core engineering team is accountable for quality and learns from its mistakes. All it costs is a relatively tiny dedicated team to manage the monthly update process by evaluating and prioritizing issues. Even that team feels valued due to their differentiated and important role and their direct engagement with solving customer problems.

Yes, sewage flows downstream and no one likes cleaning it up. However, by putting some simple processes in place, you can reduce the sewage and have those responsible mop up the mess. To me that smells like justice.

Eric Aside

What do you do if you are stuck on a dedicated SE team and are experiencing little love, misinformed fixes, duplicate effort, and no accountability from the core team? Here are a few ideas:

§  Create a rotational program with the core team. Everyone spends a month or two a year on the SE team. It’s not ideal, but I’ve already established that point.

§  Measure your efficiency and effectiveness, perhaps by the average time to resolve issues for each bucket, the regression rate, team morale, and customer acceptance of fixes (a balanced scorecard). Optimize, publish your results, and show the core engineering team how great work gets done.

§  You ship updates once a month—celebrate once a month.

Comments (5)

  1. Carol Anne says:

    On the other hand, having the original design/implement team perform SE means that the same faulty presuppositions they hold about design, about the software’s purpose, and about the users, will blind them to the same things that lie at the root of the software flaws.

    Einsten said, “We can’t solve problems by using the same kind of thinking we used when we created them.”

    What’s required is a "blended" team, consisting of some people from the original team (’cause they know "where the bodies are buried," and the underlying design assumptions) and some fresh and new (’cause they’re able to see with "fresh eyes" through the faulty assumptions made during design and implementation).

  2. Will Lees says:

    It makes sense for the core team to maintain versions n-1, n, and n+1, since the current devs are working on that code. Since dev’s change and code bases get re-written, the proficiency with the old ways gets lost and forgotten. Who here remembers how to kernel debug win95? Who would want to invest their skill base learning the hard and primitive old ways. There’s money to be made supporting NT4, but the dev team doesn’t have the inclination for that mission.

  3. Scott Cheney says:

    This argument makes the assumption that the same engineers are working on the product after it goes to market.  Also, product design often morphs between releases, especially when accomodating new infrastructure or other changes in architecture.  The net result is that the product team has less knowledge than an SE team is likely to have.  The fixes the product team makes to the older versions do not necessarily translate to better knowledge of the current version.

    While I agree with the idea that the product team needs to be accountable for keeping the cost of SE at a minimum, I think there are other ways to accomplish this than by having the product team double as the SE team.  SE is part of the product life cycle and the product team should be involved.  In my experience SE teams are often isolated, only interacting with the product team for high priority or high risk fixes.  Instead, I think the SE team needs to be an integral part of the product team.  

    To me, the SE debate is a lot like the Test team debates that have been ongoing over the years.  One result of that debate was to better define the role of Test and how to properly integrate that role into the development process.  Here we are asking whether the SE team should be separate or not at all.  Instead, I would question what aspects of SE should be done by the product team and what aspects should be done by a SE team.  I think we probably need to rethink SE and make it a more integral part of our product development process.  I see your suggestions as a step in that direction.  For example, is the small dedicated teams you mention really an aspect of the SE team that should be done with or by the product team?

  4. Yvette O'Meally says:

    I’ve seen several different SE org structures work pretty well.  Their success often depends on the life stage of the product and the commitment of the teams.    This was the subject of a pretty interesting discussion thread on the SE Cabinet about a year ago.  Various product groups with both dedicated and virtual/distributed SE teams chimed in with their opinions.  There is no one model that is clearly the best, they all have their pros and cons.  The major points made were that SE has a lifespan and takes different forms over its lifecycle.  A dedicated team tends to produce the most predictable and best quality work in terms of hotfixes.  However with a dedicated team it takes a more structured effort to drive improvements in the core team for future product.  

    To help address that, some teams rotate core people through SE, others include SE team members as participants on core feature teams, some focus on supportability/serviceability reviews etc.  I agree with the previous comments that the key is strong relationship between the new development and the sustaining functions however the organization is structured.

  5. Stefan O says:

    Here, ee have a dedicated Corrective Action (you called it SE) team separate from the core design team. Part of our process involves giving and getting feedback between the teams. The CA team gets notified of a problem and evaluates it. Part of the evaluation is, will the problem end with this fix or is it still being propagated in the new design. If it is serious, the CA team consults with the core team to do just as you describe – prevent the problem from continuing in the new design and get input on the fix.  Yes, there is duplication of effort but working together we don’t waste too much time. Also you underestimate the skill set involved with CA that is quite different from new design. CA knows how to implement a change that minimizes impact and cost, something often lost to the new design people.  Of course my experience is manufacturing not software but the ideas are similar.