Yesterday’s article was a bit of a trick question, but was a real world example. Our group encountered this in some code we were testing last week (in some pre-production code – it was not part of any product).
Somewhat surprisingly, it turns out that the code, as written, wasn’t incorrect. The hr = ERROR_NOT_SUPPORTED line was in fact correct – the code was using the fact that it was a success error code as a signal to a higher level component. As many of you pointed out, that looked like an absolute error, and I was bitten by it as well: I was making an unrelated modification to the routine and noticed the
hr = ERROR_NOT_SUPPORTED; expression and fixed it to be
hr = HRESULT_FROM_WIN32(ERROR_NOT_SUPPORTED);
But that turned out to be a HORRIBLE mistake. Even though the code had passed my unit tests, and several other developers had tested the code, one of our testers encountered a fairly reliable test failure on his machines. One of the developers in our group (after spending a long day debugging) finally chased the problem down to this change. It turns out that on certain audio cards, PerformAnOperation() was returning a failure (but only on those audio cards).
And my change propagated a failure code out of the routine (which used to be a SUCCESS return, with an informational error code). And, because I’d changed the SUCCESS error code to a FAILED error code, the caller of this routine didn’t handle the error correctly.
So what was wrong? The mistake was that the semantics associated with returning a success return code was not called out anywhere in the code. The code was correct (or rather, functioning as intended), but some of the assumptions associated with the code were not documented. So someone (me) came along later and “fixed” the code and thus introduced a bug. Correcting the function was as simple as documenting the assumption…
// Called when something happens
// S_OK if successful. If an error occurs while performing an operation, returns a success
// error code with the error number set to ERROR_NOT_SUPPORTED
<Do Some Stuff>
// Perform some operation...
hr = PerformAnOperation();
hr = ERROR_NOT_SUPPORTED; // Swallow the error from PerformAnOperation and return an indicator to the caller.
If that had been done, it would have saved us a great deal of pain.
And, as I mentioned above, this was NOT a problem in any production systems. And we’re reviewing our unit tests to see how we can improve them to ensure that problems like this issue get caught even earlier.
Ok, Kudos and comments…
Michael Ruck spotted the root cause of the error: That the failure was not specified, and thus the next poor schlump (me) coming along into the code fixes what appears to be a simple bug and thus breaks something.
Some other comments: Skywing was, of course the first person to notice the ERROR_NOT_SUPPORTED (he/she was also the first person to post) – I sort-of used him as bait because I intentionally didn’t comment that the hr = ERROR_NOT_SUPPORTED was not an error – upon reflection, I realize that that was a mistake.
A number of people were concerned about the IF_FAILED_JUMP macro, that’s just a convention we use in my group, I should have edited it out since it complicated the problem, that was my bad.
Another group of people were concerned about the goto Error/goto Exit paradigm. This is actually a convention I picked up several years ago – you move all the error handling to the end of the function (out-of-line) and then jump back to a common “cleanup” or “exit” label. That allows the “normal” case to be in-line and isolates the error handling logic to one location.
Skywing also pointed out that returning ERROR_NOT_SUPPORTED STILL isn’t correct, the facility should have been set to FACILITY_WIN32, he’s absolutely right.
A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information.
And again, several people mentioned throwing exceptions, the code’s neutral w.r.t. exception handling (although I disagree with the whole “throw an exception to return an error” paradigm – that’s a different issue).