Defensive Programming - small introduction and one not-so-tall tale

Recently, I helped one of the devs (lets call him dev1) figure out a bug. The bug was introduced into the code by a different dev (lets call him dev2). The person who was ultimately responsible for the bug creeping into the code in the first place however was a third dev (that would be me).

A little bit about defensive programming

Almost every dev does at least some level of defensive programming in their  day to day job – it’s kind’a hard not to. The difference is in the quality, quantity and in identifying which areas are more sensitive than others. Basically, you employ defensive programming if a part of your code defends itself from being misused (whether it’s because of actual code that checks to see that the right thing is being done or if it’s because of the way the system flows that makes sure that Bad Things cannot happen). A great place where you can usually see a lot of defensive programming is in public API entry points – those are usually hardened against misuse simply because they are going to be used by so many people who don’t necessarily have the expertise or knowledge to properly use them the first time around. However, defensive programming is also of paramount importance for writing code that will be easier to maintain and expand in the future of the product. In these cases, the misuse one is protecting the code against is that of your fellow developers or that of future developers. Even code that is only a few months can sometimes be misused – people forget assumptions that were made and even if the code is well documented, sometimes people skim when reading the green.

It is important to note that writing defensive code needs to be balanced against other engineering practices. Too much defensive code can make the product less readable, less performing, too complex or it may just take too much time to write to be valuable enough. As such, you always need to weigh a defense mechanism you put in place against these other very important issues to make sure you are not doing more harm than good. Also, just like any other piece of code, you always need to be on your guard to make sure you did not break any of it.

There are many types of defense mechanisms (and many sub-types for those), some more obvious than others. I will try to go over a few of those, just to give a feel to how things work:

Compliance code

This is probably what comes to most people’s mind when they think about defensive programming. In this case, the developer adds actual code (either debug only or not) to the program that checks to make sure that a certain assumption is actually true. This can be anything from valid range of values of a parameter, through a certain state of an object to more complex things such as verifying that a data-structure has not been corrupted.

Social coding

You don’t see much of this in public APIs since it tends to look silly, but you can see it in some internal code, usually when devs are forced to make a not-so-obvious or dangerous decision. This usually comes into play in actual names of functions, classes and variables. The idea behind it is that while a dev can (and sometimes will) ignore comments embedded in code, they cannot ignore a warning if it’s part of the code he uses. Examples for this are:

  • CMyClass::GetIStreamNoAddref() or m_myParentNoAddref – in this example, the dev wanted to  make sure people who use the code after her will know w/o any need to read comments that a certain field/method does not add ref and as such does not need to be released – devs will think 20 times before doing m_myParentNoAddref->Release() – something they wont do before doing m_myParent->Release().
  • SafeHandle.DangerousGetHandle – this is an actual example from the CLR where the devs thought it important enough to call out a function that can be dangerous if misused (and hopefully make the users read the docs to figure out why it’s dangerous).
  • m_dontUseMe – Another example where a class member was often misused as an ID (because it seemed to be unique while it wasn’t really) and that caused endless problems. At the end, the mere name change to m_dontUseMe reduced the misuse considerably…

Restricted Flow

Restricted Flow is a somewhat less used defensive programming technique – there aren’t that many cases where it can be employed, however, it is probably one of the best sorts of techniques because, for the most part, it is non-intrusive and more elegant than other ways.

An example for that would be an object that wraps an operation. If you do not want certain other parts of the system to be available, you can make sure to not pass those in as parameters to the object thus limiting the devs who are going to do work on that object. Of course, this way is not infallible by any stretch, people can go to length to add the information they need and then use it. However, if that info is not readily available, people would tend to at least do some sort of due diligence to make sure they are not doing the wrong thing.

Technical restrictions

This is a hybrid version of Restricted Flow and Compliance coding– it basically means that the dev will use some sort of technical feature of the system (usually) to put a hard-restriction on future devs – one that they will find much harder to work around. A good example of this is to run code in a separate process (or separate app-domain). This greatly restricts that code, but it also makes sure it cannot muck with the calling system state.

Code reviews

This can be considered a defensive programming practice of sorts. In Excel Server, for example, there is no product code that gets checked in w/o having another developer pour over it looking for problems (code reviews can have a whole series of posts just about them)

That wasn’t a little bit, you rotten liar

Yeah, that was a bit longer than I initially intended. Back to our story - Dev1 asked me for some help with the bug. After looking it over, we were able to figure out what the problem was. It seemed like we were doing something illegal in our code – something we did not intend to allow. The problem was in the piece of code that was responsible for loading workbooks. Since execution of that code doesn’t “belong” to any user but, rather, is shared by many, we did not want the user information to be used in that code. To achieve that, we made sure the code was properly isolated and did not include user information in any of its entry points. And indeed, for a whole year or so, that code was safe from that respect. So what happened?

Somebody (me) introduced a new infrastructure mechanism to the server.  When developing multithreaded apps, it is sometimes useful to make use of thread local storage to store “context sensitive information” about various parts of the system. One such context sensitive piece of information, on servers, is who the user that executed the request was (and also what the request object is).

Now, having been bitten by TLS a few times in the past 12 years or so, I was reluctant to add code to the system that used it. However, after hours of therapy and some cludgy code, I decided that the pros in this case outweighed the cons and decided to go ahead and add a feature to the infrastructure that would allow our devs to figure out the user/request w/o having to pass them as parameters over and over again to all sorts of parts of the system. Since our system is highly asynchronic, I also had to make sure that information properly “hopped” from one thread to another as various parts of our system executed.

The code Dev2 added, and which broke us, was using the new mechanism to grab some information that was located on the user. However, that’s something we did not want them to do. The biggest problem with this was that it worked properly most of the time. It was just when the user timed out before the shared operation finished that we would get an error. And that’s the worst part – the fact that Dev2 would be hard pressed to find a scenario where this would not have worked, and thus he thought everything was okay and checked in the code.

And this is where we conclude that this is entirely my fault. When I introduced the new code, I inadvertently allowed devs an easy way to defeat one of the defensive mechanisms we had in place. While the change was not necessarily major (all I did was supply information – I didn’t change any flow of code or anything else of that sort), I did open up a way for other people to do Bad Things.

The correct thing to do, when I introduced the new code, would have been to add a defensive mechanism that would immediately return an error (and assert!) if the code is used in one of those sensitive areas. Adding such code is incredibly simple and would have taken a fraction of the time it took to find and fix the bug.