What does "Robust" mean?

Back in the days of NT OS/2, one of the things that was absolutely drilled into the development team was robustness.  I even went so far as to write "Robustness" on my whiteboard in 1 foot high letters as a daily reminder.

The team distributed mugs with "INDUSTRIAL STRENGTH" on them (to indicate that NT, unlike previous MS operating systems) needed to be robust enough to work in mission critical environments.

One of the problems with this, of course is that "robustness", like "policy" and "session" is one of those words that really has no meaning.  Or rather, it has so many meanings that it has no effective meaning.

The problem with "robustness" is that defining what robustness is is situational - the very qualities that define the robustness of a system depend on how and where it's deployed.  Thus it is meaningless to consider robustness without first describing the scenario. 

I first learned this lesson in college, in my software engineering class (15-413 at Carnegie-Mellon).  When the instructor (whose name is escaping me currently) was discussing the concept of "robust engineering" he gave the following examples.

An ATM needs to be robust (ATM's were relatively new back then, so this was a 'trendy' example).  It would be VERY, VERY bad if an ATM was to drop a transaction - if you withdrew money and the machine crashed after updating your account ledger but before giving you the money, you would lose money.  Even worse, if the machine crashed after giving you the money, but before updating your account ledger, then the bank would be out of money.  So it's critical that for a given transaction, the ATM not crash.  On the other hand, it's not a big deal for an ATM to be down for up to days at a time - customers can simply find a new ATM.

On the other hand, the phone network also needs to be robust (this was soon after the AT&T breakup, so once again, it was a 'trendy' example).  It's not a problem if the phone network drops a phone call, because the caller will simply reestablish the phone connection.  Btw, this doesn't hold true for some calls - for instance the robustness requirements for 911 are different from normal calls due to their critical nature.  On the other hand, it would be disastrous for the phone network to be down for more than a couple of minutes at a time.  Similarly, the land line to my house is externally powered - which means that even if the power grid goes down, I can still use the phone to call for help if I need it.

So which is more robust?  The answer is that they BOTH are, given their operating environments.  The robustness criteria for each of these is orthogonally different - the criteria that define robustness for an ATM are meaningless for a phone network.

I'm not even sure that there are any universal robustness principals - things like "don't crash" really are meaningless when you think about them - the reality is that EVERYTHING crashes - my VCR crashes, my stereo crashes, all electronics crashes (given the right circumstances - most (but not all) of the electric appliances in my house don't work when the power to my house goes away).  Robustness in many ways is like the classic definition of pornography: "I shall not today attempt further to define the kinds of material... but I know it when I see it."

The last time I tossed out a dilemma like this one (Measuring testers by test metrics doesn't) I got a smidge of flack for not proposing an alternative mechanism for providing objective measurements of a testers productivity, so I don't want to leave this discussion without providing a definition for robustness that I think works...

So, after some thought, I came up with this:

A program is robust if, when operating in its expected use scenarios, it does not cause unexpected behaviors.  If the program DOES fail, it will not corrupt its operating data.

I'm not totally happy with that definition, because it seems to be really wishy-washy, but I've not come up with anything better.  The caveat (when operating in its expected fashion) is necessary to cover the ATM and the phone network cases above - the ATM's expected use scenarios involve reliable transactions, but do not involve continual operation, the phone network's expected use scenarios are just the opposite - they involve continual operation but not reliable transactions (phone calls).  One of the other problems with it is "unexpected behaviors" - that's almost TOO broad - it covers things like UI glitches that might not properly be considered relevant from a robustness standpoint (but they might - if the application was a rendering application, then rendering issues effect robustness).

The second sentence was added to cover the "don't force the user to run chkdsk (or fsck) on reboot" aspect - if you DO encounter a failure, you're system will recover.  There's even weasel-words in that clause - I'm saying it shouldn't corrupt "operating data" without defining operating data.  For example,  NTFS considers its filesystem metadata to be operating data, but the users file data isn't considered to be operating data.  Exchange, on the other hand, considers the users email messages to be its operating data (as well as the metadata).  So the robustness criteria for Exchange (or any other email system) includes the users data, while the robustness criteria for NTFS doesn't.