Error Codes, again...

One of the tech writers in my group just asked a question about documenting error codes.

I've written about my feelings regarding documenting error codes in the past, but I've never actually written about what it means to define error codes for your component.

The critical aspect of error codes is recognition of the fact that error codes are all about diagnosibility. They're about providing enough information to someone to figure out the cause of a problem.  This is true whether you use error codes or exceptions, btw - they're all mechanisms for diagnosing failures.

Error codes serve two related purposes.  You need to be able to provide information to the developer of an application that allows that developer to diagnose the cause of a failure (or to let the developer of an application determine the appropriate corrective action to take in the event of a failure).  And you need to be able to provide information to the user of the application that hosts your control to allow them to diagnose the cause of a failure.

The second reason above is why there are APIs like FormatMessage which allow you to determine a string version of system errors.  Or waveOutGetErrorText, which does the same thing for the multimedia APIs (there's a similar mixerGetErrorText, etc).  These APIs allow you to get a human readable error string for any system error.

One of the basic requirements for any interface is that you define the errors that will be returned by that interface.  It's a fundamental part of the contract (and every interface defines a contract).

Now your definition of errors can be simple ("Returns an HRESULT which defines the failure") or it can be complex ("When the frobble can't be found, it returns E_FROBLE_NOT_FOUND").  But you need to define your error codes.

When you define your error codes, you essentially have three choices:

  1. You can choose to simply let the lower level error code bubble up to your caller.
  2. You can choose to define new error codes for your component.
  3. You can completely define the error codes that your component returns.

There are pros and cons to each of these choices.

The problem with the first choice is that often times the low level error code is meaningless.  Or worse, it may be incorrect.  A great example of this occurs if you mess up the AEDebug registry key for an application.  The loader will attempt to access this registry key, and if there is an error (like an entry not found), it will bubble the failure up to the caller.  Which can result in your getting an ERROR_FILE_NOT_FOUND error when you try to launch your application, even though the application is there - the problem is that the AEDebug registry key pointed to a debugger that wasn't found.  But bubbling the failure up has killed diagnosibility - the actual problem had to do with the parsing of a registry key, but the caller has no way of knowing that.  This is also yet another example of Joel's Law of Leaky Abstractions - the lower level information leaked to the higher level.

The problem with the second choice is actually that that it hides the information from the lower level abstraction.  It's just the opposite - sometimes you WANT the abstraction to leak, because there is often useful information that gets lost.  For instance, in the component on which I'm working, RPC_X_ENUM_VALUE_OUT_OF_RANGE, RPC_X_BYTE_COUNT_TO_SMALL, and a couple of other RPC errors are mapped to E_INVALIDARG.  While E_INVALIDARG is reasonably accurate (these are all errors in argument), RPC returned specific information about the failure that hiding the error masks.  So there has been a loss of specificity about the error, which once again hinders diagnosability - it's harder to debug the problem from the error.  On the other hand, the errors that are returned are domain specific.

The third choice (locking down the set of error codes returned) is what was done in my linked example.  The problem with this is that it locks you into those error codes forever.  You will NEVER have an opportunity to change them, even if something changes underneath.  So when the time comes to add offline storage to your file system, you can't add a "tape index not found" error to the CreateFile API because it wasn't one of the previously enumerated error codes.

The first is a recipe for confusion, especially when the lower level error codes apply to another domain - what do you do if CreateThread returns ERROR_PATH_NOT_FOUND?  The third option is simply an unmitigated nightmare for the long term viability of your system.

My personal choice is #2, even with the error hiding potential.  But you need to be very careful to ensure that your choice of error codes is appropriate - you need to ensure that you provide enough diagnostic information for a developer to determine the cause of the failure while retaining enough domain specific information to allow the user to understand the cause of the failure.

Interestingly enough CLR Exceptions handle the leaky abstraction issue neatly by defining the Exception.InnerException property which allows you to retain the original cause of the error.  This allows a developer attempting to diagnose a failure to see the ACTUAL cause of the failure, while allowing the component to define a failure that's more germane to its problem domain.