What does “Robust” mean?

Back in the days of NT OS/2, one of the things that was absolutely drilled into the development team was robustness.  I even went so far as to write “Robustness” on my whiteboard in 1 foot high letters as a daily reminder.

The team distributed mugs with “INDUSTRIAL STRENGTH” on them (to indicate that NT, unlike previous MS operating systems) needed to be robust enough to work in mission critical environments.

One of the problems with this, of course is that “robustness”, like “policy” and “session” is one of those words that really has no meaning.  Or rather, it has so many meanings that it has no effective meaning.

The problem with “robustness” is that defining what robustness is is situational – the very qualities that define the robustness of a system depend on how and where it’s deployed.  Thus it is meaningless to consider robustness without first describing the scenario. 

I first learned this lesson in college, in my software engineering class (15-413 at Carnegie-Mellon).  When the instructor (whose name is escaping me currently) was discussing the concept of “robust engineering” he gave the following examples.

An ATM needs to be robust (ATM’s were relatively new back then, so this was a ‘trendy’ example).  It would be VERY, VERY bad if an ATM was to drop a transaction – if you withdrew money and the machine crashed after updating your account ledger but before giving you the money, you would lose money.  Even worse, if the machine crashed after giving you the money, but before updating your account ledger, then the bank would be out of money.  So it’s critical that for a given transaction, the ATM not crash.  On the other hand, it’s not a big deal for an ATM to be down for up to days at a time – customers can simply find a new ATM.

On the other hand, the phone network also needs to be robust (this was soon after the AT&T breakup, so once again, it was a ‘trendy’ example).  It’s not a problem if the phone network drops a phone call, because the caller will simply reestablish the phone connection.  Btw, this doesn’t hold true for some calls – for instance the robustness requirements for 911 are different from normal calls due to their critical nature.  On the other hand, it would be disastrous for the phone network to be down for more than a couple of minutes at a time.  Similarly, the land line to my house is externally powered – which means that even if the power grid goes down, I can still use the phone to call for help if I need it.

So which is more robust?  The answer is that they BOTH are, given their operating environments.  The robustness criteria for each of these is orthogonally different – the criteria that define robustness for an ATM are meaningless for a phone network.

I’m not even sure that there are any universal robustness principals – things like “don’t crash” really are meaningless when you think about them – the reality is that EVERYTHING crashes – my VCR crashes, my stereo crashes, all electronics crashes (given the right circumstances – most (but not all) of the electric appliances in my house don’t work when the power to my house goes away).  Robustness in many ways is like the classic definition of pornography: “I shall not today attempt further to define the kinds of material… but I know it when I see it.”

The last time I tossed out a dilemma like this one (Measuring testers by test metrics doesn’t) I got a smidge of flack for not proposing an alternative mechanism for providing objective measurements of a testers productivity, so I don’t want to leave this discussion without providing a definition for robustness that I think works…

So, after some thought, I came up with this:

A program is robust if, when operating in its expected use scenarios, it does not cause unexpected behaviors.  If the program DOES fail, it will not corrupt its operating data.

I’m not totally happy with that definition, because it seems to be really wishy-washy, but I’ve not come up with anything better.  The caveat (when operating in its expected fashion) is necessary to cover the ATM and the phone network cases above – the ATM’s expected use scenarios involve reliable transactions, but do not involve continual operation, the phone network’s expected use scenarios are just the opposite – they involve continual operation but not reliable transactions (phone calls).  One of the other problems with it is “unexpected behaviors” – that’s almost TOO broad – it covers things like UI glitches that might not properly be considered relevant from a robustness standpoint (but they might – if the application was a rendering application, then rendering issues effect robustness).

The second sentence was added to cover the “don’t force the user to run chkdsk (or fsck) on reboot” aspect – if you DO encounter a failure, you’re system will recover.  There’s even weasel-words in that clause – I’m saying it shouldn’t corrupt “operating data” without defining operating data.  For example,  NTFS considers its filesystem metadata to be operating data, but the users file data isn’t considered to be operating data.  Exchange, on the other hand, considers the users email messages to be its operating data (as well as the metadata).  So the robustness criteria for Exchange (or any other email system) includes the users data, while the robustness criteria for NTFS doesn’t. 

Comments (13)

  1. Anonymous says:

    "So the robustness criteria for Exchange (or any other email system) includes the users data, while the robustness criteria for NTFS doesn’t."

    At least, not at the moment. I recall that Longhorn (at least, as announced at the 2003 PDC) will feature Transactional NTFS. Is this still slated for Longhorn, and will it protect simple file writes automatically? Obviously atomic updates to multiple files will need additional coding, but what can we expect, if anything, for existing applications?

  2. Anonymous says:

    Mike, that may be true, I don’t know. I do know that if it’s enabled, it won’t be enabled by default – the performance cost of transacting application data writes is prohibitive (network file copies would be twice as slow, for example).

  3. Anonymous says:

    This is just the best blog. I’m always happy when there’s new Larry Osterman to read.

    Here’s my meager attempt at a "robust" definition:

    A program is robust if, by design, it protects critical data and critical access, provides fail-safe operation, and allows for system recovery when the unexpected does occur.

    Feel free to pick it apart. I’d be honored.

  4. Anonymous says:

    How about this: A program (or machine) is robust if it does not exhibit failure modes that are costly.

    For the ATM, screwing up the customer’s account is high cost; failure to do anything isn’t. For the telephone network, failure to connect is high cost.

    Also a quibble: In the early 80’s I was in the electronic security industry. At that time, it was common for ATM’s to fail to account for 1 to 2% of the cash they dispensed. Even then, it was cheaper for banks to eat this 1-2% than to pay human tellers. However, every bank understood that it was very expensive to charge a customer’s account without dispensing the money — reconciliation typically cost tens of dollars, and frequently resulted in the loss of a customer, whose acquisition cost was often about $50.

  5. Anonymous says:

    This all sounds right and standard for a sensible definition of robustness.

    A quibble of my own: where on Earth do you come from that thinks this is a grammatically meaningful sentence:

    <blockquote>The team distributed mugs with "INDUSTRIAL STRENGTH" on them (to indicate that NT, unlike previous MS operating systems) needed to be robust enough to work in mission critical environments.</blockquote>

    Some mad parenthesis there 🙂

  6. Anonymous says:

    Interesting article. My understanding of "Robust" has always been substantially different than your definition. It seems to me that a program that does not cause problems "when operating in its expected use scenarios" is merely _correct_. I mean, if a program fails in its expected use scenario, it cannot be considered as a valid solution at all.

    To me, robustness is a virtue proved by "failing gracefully" (not killing anyone around etc.) even when the conditions have not been expected. The response to the problems that were unexpected is the key definition to me. (Well, obviously, you cannot create a system that would be able to handle anything, so the conditions have to be "expected" at least in some generic way, but nonetheless I feel they shouldn’t be named "expected scenarios".)

  7. Anonymous says:

    I think the real problem with your definition is that you seem to use "unexpected behavior" and "failure" interchangeably… I think they are 2 different things.

  8. Anonymous says:

    Rob: That’s because I don’t have an editor to catch my stupid mistakes.

    Petr: That’s why I have the second sentence (about recovering without lost data). You’re right that it’s "merely _correct_", and a correct program _should_ be robust.

    As I said in the beginning, everyone likes to use the word "Robust" (or "Reliable") but the basic problem here is that the word is meaningless. Is a phone network more robust (or reliable) than an ATM because it’s always available?

    Robustness is clearly a desirable feature in a system, but how do you know if your system _is_ robust?

  9. Anonymous says:

    Your definition is most illuminating given your (and your employers) position within the IT industry.

    The distinction of "operating in its expected use scenarios" is far too weak IMO and experiance oft reinforces this belief.

    Truly robust programmes manage unexpected use cases as well as the (apparently) impossible ones.

    Had this ethic permeated the industry some time ago I suspect the times would not be quite as exciting or depressing depending (respectively ) whether you are observing or afflicted with the manic exploit riddled jungle we find ourselves in.

    Personally I don’t think too many "real" programmers are willingly complict in this but the expediencies driven by Marketers to ship now, fix later and the ultra-rationalist vesting mentat make it difficult for ppl to deliver on such principles.


  10. Anonymous says:

    A while back, I wrote about how I disliked the word &quot;Robustness&quot; because it’s meaning was so vague. …