Prerequisites to Data Driven Quality

A previous post introduced the concept of data driven quality. Moving from traditional, up-front testing to data driven quality is not easy. It is not possible to take just any product and start utilizing this method. In addition to cultural changes, there are several technical requirements on the product. These are: early deployment, friction-free deployment, partial deployment, high speed to fix, limited damage, and access to telemetry about what the users are doing.

Early deployment means shipping before the software is 100% ready. In the past, companies shipped beta software to limited audiences. This was a good start, but betas happened once or twice per product cycle. For instance, Windows 7 had two betas. Windows 8 had one. These were both 3 year product cycles. In order to use data to really understand the product's quality, shipping needs to happen much more often. That means shipping with lower quality. The exact level of stability can be determined per product, but need not be very high if the rest of the prerequisites are met. Ken Johnston has a stimulating post about the concept of Minimum Viable Quality.

Friction-free deployment means a simple mechanism to get the bits in front of users. Seamless installation. The user shouldn't have to know that they have a new version unless it looks different. Google's Chrome browser really pioneered here. It just updates in the background. You don't have to do anything to get the latest and greatest version and you don't have to care.

Because we may be pushing software that isn't 100%, deployment needs to be limited in scope. Software that is not yet fully trusted should not be given to everyone all at once. What if something goes wrong? Services do this with rings of deployment. First the changes are shown to only a small number of users, perhaps hundreds or low thousands. If that appears correct, it is shown to more, maybe tens of thousands. As the software proves itself stable with each group, it is deemed worthy of pushing outward to a bigger ring.

If someone goes wrong, it is important to fix it quickly. This can be a fix for the issue at hand (roll forward) or reversion to the last working version (roll back). The important thing is not to leave users in a broken state for very long. The software must be built with this functionality in mind. It is not okay to leave users facing a big regression for days. In the best case, they should get back to a working system as soon as we detect the problem. With proper data models, this could happen automatically.

Deployment of lower quality software means that users will experience bugs. Total experienced damage is a function of both duration and magnitude. Given the previous two prerequisites, the damage will be limited in duration, but it is also important to limit the damage in magnitude. A catastrophic bug which wipes out your file system or causes a machine not to boot need not last long. Rolling back doesn't repair the damage. Your dissertation is already corrupted. Pieces of the system which can have catastrophic (generally data loss) repercussions need to be tested differently and have a different quality bar before being released.

Finally, the product must be easy to gather telemetry on what the user is doing. The product must be capable of generating telemetry, but the system must also be capable of consuming it. The product must be modified to make generating telemetry simple. This is usually in the form of a logging library. This library must be lightweight. It is easy to overwhelm the performance of a system with too slow a library and too many log events. The library must also be capable of throttling. There is no sense causing a denial of service attack on your own datacenter if too many users use the software.

The datacenter must be capable of handling the logs. The scale of success can make it difficult. The more users, the more data will need to be processed. This can overwhelm network connections and analysis pipelines. The more data involved, the more computing power is necessary. Network pipes must be big. Storage requirements go up. Processing terabytes or even petabytes of data is not trivial. The more data, the more automated the analysis must become to keep up.

With these pieces in place, a team can begin to live the data driven quality lifestyle. There is much than just the technology to think about though. They very mindset of the team must change if the fourth wave of testing it to take root. I will cover these cultural changes next time.