Whole program vs. local analysis

A quick note for today. Maybe more tonight.

Internally at Microsoft, as you can imagine, we have a number of tools to assist building our software.  Many are internal-only; contrary to the black-helicopter gang out there, there is a lot of cost to shipping software and it can divert your attention and resources from your primary goal of making your teams' code faster/better/whatever.

I won't divulge the details or identity of any of these tools; I'm sure you can references to all of them if you cover the MSFT blogosphere.  Let's call one of them "mega-lint".

Some people have never seen the need for a tool like lint; that's what the compiler is for, right?  The problem is that as I have alluded to several times now, compilers are all about syntax and translation semantics; their goal up to now has been to let you express your ideas more fluidly, not necessarily with better intent.

In previous example where the void reverse_array() function was calling member functions/operators on a type, on one hand it was "obvious" that they would succeed but on the other hand if you want to look at the contract long and hard, it's actually not obvious that they might succeed.  Consider, for example, a language/runtime environment where the use of operator[] was remoted (auto-remoted interfaces are problematic and will be a topic of several entries later on).  The "obviousness" that the only failure modes were covered due to our working within the documented limits of the array are now questionable.  Sorry to lapse into C++ for a bit but consider if the signature of reverse_array() had been:

template <typename T> void reverse_array(T &array) { ... }

Sure std::vector may give you some sort of guarantee, but given the requirements (existence of a size() member function and implementation of operator []), you had better code your template to deal with exceptions being thrown (a/k/a errors being returned) by all of those cases.  I had to invoke C++ here because the C equivalent of a macro would be somewhat uncompelling and I'm not familliar with Java and CLR generics enough to say whether the error contract for the members can be explicitly called out.

Back to the point.  There's no way to declare that we don't expect operator[] to fail as long as we pass in legal bounds.  (consider a sparse array implementation which may have to dynamically allocate the element on first access...)

How do we even figure out that there is a problem there?  Should all code just always write the try/catch blocks?  Probably not!  The main claim to fame for exception handling is that you should write less code, not more.  But then someone with a try/catch up higher on the stack is probably not expecting to catch the silly operator [] failure.

This kind of problem is impossible to diagnose without either massive simulation tools which can do whole program analysis and simulation or, and this is my preference, reducing the problem to be a more local one about contract description and local analysis.

Local analysis is much easier since it's what every programmer does when reading code.  Now, we've been trained badly; either there are APIs which, by convention, you ignore their failure status (fclose, fprintf, etc.) or there are new patterns building up where you have a try/catch block around a function where you catch all errors and rethrow a different exception.  Both of these inhibit easy local analysis and both patterns must be stopped if we're going to make progress in making software more reliable.

Usual caveat applies: People writing actual applications instead of reusable libraries can dial this setting pretty much wherever they want to.  Shared code authors should be very aware of this problem and the "v1" people out there (you know them, the ones who have a cool idea, get the awards and good bonuses and then move on to the next v1 project leaving their path of destruction a mile wide behind them...) need to get some discipline.