I’ve started commenting on API design lessons we’ve learned from mistakes in ICorDebug. I previously blogged about dangers of doing complicated work in IUnknown::Release or C++ destructors. Another API design point is [update: include 2nd half of sentence]:
Clearly describe object lifespans in terms of the other methods in the API..
Initially, it may seem like an object has an “obvious” lifespan. However, after the API evolves a little, designers get replaced, and you get a few more fresh opinions, it may not seem obvious anymore. Things that were “obvious” 6 years ago when we first designed ICorDebug aren’t so obvious today.
Why are well-defined lifespans important?
1) If an object does not have a well-defined lifespan, then it’s unclear when certain operations become illegal. This can cause unspecified behavior or non-determinism in the API. Futhermore, the various methods on the object may become illegal at different times.
2) If the object holds external resources, it needs a well-defined lifespan to determine when it can clean up those resources. COM-classic objects may not be able to clean them up in the final call to Release, for reasons described here.
3) If the object’s lifespan is not explicitly defined, it will be implicitly defined by the current implementation because clients will make assumptions based off their observations.
4) Objects with poorly defined lifespans are subject to recycling bugs. This is the class of bugs where an object does not recognize when the underlying data is destroyed, and then starts operating on random data. For example, identifying a process by a PID can yield a recycling bug if the original process exits and a new process gets created with the same pid.
Here are some examples where this came up in ICorDebug.
1) Value inspection. An ICorDebugValue represents a variable in the debuggee (such as a local variable of type System.Object). Should the ICorDebugValue be valid across continues? It turns out it depends on a bunch of random factors such as what type of data the value is. This is confusing and bug-prone.
2) Stack frames. ICorDebugChain describes part of a stack trace. Should an ICorDebugChain object be valid even after the debuggee has resumed and the thread is no longer at that stack? We never clearly defined this in v1.1. It turned out some methods would succeed, some would fail, and some were undefined. In v2.0, we tried to dispose the entire object, and then found that clients needed some methods to still work even across continues.
3) Process exit. An ICorDebugProcess object represents the debuggee’s process. Should it be valid even after the debuggee has exited? It turns out some methods are valid (like getting the pid) but most aren’t.
4) Enumerators. ICorDebug allows enumeration of many things (like threads in a process and variables in a function) via ICorDebugEnum interfaces. When does the enumerator become invalid? What if an element dies in the middle of an enumeration?
How to solve this?
1) Clearly specifiy the lifespan in terms of the other methods in the API. Don’t just describe lifespans in terms of fuzzy concepts that can’t be aggressively checked and enforced. Identify a specific set of methods that will guarantee the object becomes dead (such as Dispose). If an object is only valid during the lexical scope of a callback, state that lest a client caches the object globally and then uses it outside the callback.
2) Embrace an IDispose pattern. This will help get clients thinking about lifespans.
3) Enforce the lifespan. Clients will take a dependency on the actual lifespan, not necessarily the documented lifespan. There’s also a support cost here: if a client misuses the library and the library crashes; the library authors get hit with the initial investigation. Maybe all they get is a crash dump from watson and they may not even be able to trace it back to a client bug. A robust library insulates itself from client misbehavior and deflects these sort of support costs.