Costs, Modelling, and Managing Risk

Here's a little peice of mail I send out to some folks today discussing some root causes of performance problems generally. I've written about basically all of this before but here it is in summary form perhaps mostly to prove that what I tell my colleagues here really is the same as what I tell you guys. 

----

We do very little to encourage people to understand the properties of their algorithms before they start coding.  Generally this is a much bigger problem than any machine level anomalies  you might encounter.  The “real costs” people need to know are more often at a much higher level than the machine.

The big two problems are almost invariably:

  1. The developer is using an algorithm that is fundamentally unsuitable for the task at hand
  2. The developer has taken a dependency on a technology that is fundamentally too costly in the relevant context

Those are broad categories but I say them that way to emphasize that the problem is almost certainly not something like “your algorithm requires 35 TLB slots and you only have 32 available”.

That said there is a prescription for success, and it is not “code it all up and then measure the heck out of it.”  It’s too late by then and teaching such a practice teaches despair.

Engineering is about achieving predictable results for predictable costs.  Notwithstanding that basic truth we rarely set out to predict anything in any kind of reasonable way.  What I’ve been trying to teach for the last 3+ years now is a fairly simple process:

  1. Decide what level of performance you are looking for in rough terms – do you want an “A+” or is a “C-” good enough?  An “F” is never acceptable by definition.
  2. Understand what that grade of performance looks like in the terms your customer thinks about
  3. Consider the limits the metrics above place on your consumption of resources (cycles, disk reads, network round trips, whatever is likely to be relevant)
  4. Postulate an algorithm and then take steps to cost it out in terms of the resource(s) in (3) before you code it all.  Be as detailed as is necessary but not gratuitously detailed.

The idea is to control risk.  If you are shooting for an C- chances are you can very easily demonstrate that you’ll be able to meet the goal because it should be an easy goal to hit.  A few quick calculations on the back of a napkin will do the job.  Contrariwise if you are shooting for an A+  – world class performance – chances are that your margins are razor thin and you will be testing the limits of the hardware.  You will want to spend a considerable amount of time trying out things and perhaps creating proof of concept implementations, models etc.  It will be cost effective to do so under strenuous requirements.

The bottom line is that you should know, very early in your cycle, that you are substantially likely to succeed. 

All of this plays directly into having a basic understanding of elementary framework costs and architecture costs.  It’s not that hard to get the facts you need via experiment. 

I get very worried when people say things like “Productivity and cleanliness always trump performance.”   Productivity is about creating product.  A “clean” design which fundamentally fails to address performance requirements is not an example of a productive enterprise, it is a looming disaster.  A developer productively engaged in creating a failure is uninteresting. 

I like to teach that it is best to consider the entire cycle from a risk management perspective.  Complex design incurs risk, significant unknowns incur risk, unmodeled security threats, unwritten code, and messy code all incur risk. 

At any given stage you take the steps most needed to best control the remaining risks with the time you have (including the risk of not finishing). Keeping in mind that a messy yet performant design has merely introduced a different category of risks; maybe worse than the performance risk was in the first place.  Risk is the ultimate equalizer and importantly it teaches balance in approach. 

After all, overdoing your performance work is just another way to fail.