Don’t do “complicated” work in Release().

BradA has been talking about API design guidelines, which reminds me of a bunch of subtle API design lessons we’ve learned from mistakes made in ICorDebug. I’ve started a list and it’s depressingly long. As I blog about these, I’ll pull samples from ICorDebug, but you don’t need to know ICorDebug to appreciate these issues.

On of the more subtle yet significant issues regards calling IUnknown::Release():

 

Don’t have your implementations of IUnknown::Release (or destructors) for external objects do “complicated” work.

 

This is not just an implementation detail - avoiding this may require an API change.

 

([update]: fixed this code snippet after reading Junfeng's comment here)

An object’s destructor has the same caveats if Release() is often implemented like:

if (!InterlockedDecrement(&m_ref)) { delete this; }

 

 

This was sadly a common issue in V1.0 ICorDebug because our inprocess debugging model means ICorDebug objects in the debugger may often hold resources in the debuggee. For example, ICorDebugValue and ICorDebugEval held resources in the debuggee process that were only freed in the dtor.  Furthermore, these resources could only be freed when the debuggee was in a certain state (specifically: stopped at a managed debug event).

 

In a free-threaded API (such as ICorDebug),  Release / dtors have the following issues:

1) Clients can call Release on any thread. For example, MDbg’s managed wrappers may invoke Release() on the finalizer thread at a random time of the GC’s choosing.

2) Clients can call Release at any time, including “bad” times such as when the library is not in a good state to do complicated work or in the middle of a race window.  If Release() is doing complicate work, it may race against other calls into the library. This is compounded by design patterns that may frequently toggle reference counts, such as smart (auto) pointers.

3) Bugs in the Clients app may mean Release is not called at all. Although this is technically a bug in the client, you may never be able to prove that and in the meantime, it will manifest as a bug in your library.

4) Clients may call Release on a set of objects in any order. If a client has references to objects A,B, and C, it can release them in any order. It may do A,B,C; or it may do C,B,A. Furthermore, if different threads are calling Release(), the order may be non-deterministic.  It is very easy for library developers to make innocent, but illegal, assumptions about shutdown order that will lead to crashes when violated. For example, what if C has a weak reference to A? In v1.0, many ICorDebug objects had a weak reference to their containing ICorDebugProcess object.  Another issue may be if the dtors need to take unordered locks, innocently calling Release() in a random order may potentially deadlock. In both cases, these are bugs in the library; but they are still easy mistakes to make and potentially difficult issues to uncover in testing.

5) Releases may cascade. If a dtor calls the final Release() on objects it owns, that may invoke other dtors, which may invoke others and cascade into a long chain of operations coming out of an seemingly simple Release(). This cascading effect compounds the above problems. Cascading releases which do invoke complicated dtors that are triggered by a random thread occuring at random times can also cause great deals of non-determinism in a library.

 

Given these issues, I’ll let you judge exactly what “complicated” means. Though it certainly includes anything involving managing an external resource or doing any cross-thread causality.

 

Here are some steps we’ve taken in v2.0 ICorDebug to mitigate these problems:

1) Provide something like an IDisposable pattern which lets you have an explicit lifetime on an object rather than just implicitly assume an object dies on the final release. The only resource a disposed object holds is the raw underlying memory of the object. It does not have any references to other objects and certainly no external resources. This addresses several issues:

a. You can then place limitations on when Dispose() can be called. For example, in ICorDebug, we only let you dispose objects when the debuggee is stopped.

b. This allows you to do the “complicated” work in the dispose() instead of the dtor.

c. This also protects you against a client not calling Release().  In such cases, the client will only leak the raw memory of the object, but not more precious resources (particularly resources that may not be reclaimed when the client exits).

d. It hoists the cascading effect to well spots where Dispose() is explicitly called.

If exposing this externally is a breaking change, you could still implement it internally. The library could maintain an extra reference to avoid the dtor being called at a bad time, and then dispose() at a good time and release that extra reference. This is what ICorDebug does in v2.0. Disposed objects will now return CORDBG_E_OBJECT_NEUTERED on most methods. In many cases, such error paths resulted in access violations (AV) in v1.1.

2) Don’t make promises about what happens in a call to Release() , because this may tie your hands. Unless you’re absolutely sure you can free the resource X held by the object immediately from any thread and at any time, you can’t say things like “Resource X will be freed immediately upon the final Release of the object.”. Rather, say “Resource X will be freed at some unspecified time after the final Release. Call Dispose() to free Resource X immediately.” This gives you the flexibility to do things like delay executing cleanup code until a safe time by internally queuing a cleanup worker. For example, ICorDebugHandleValue owns a GCHandle in the debuggee process which implements this pattern. Calling ICorDebugHandleValue::Release() will not immediately free the gchandle; it will queue it to be freed at the next opportune time.

3) Don’t do anything “complicated” in the dtor.  For example, don’t have a dtor do cross-process calls to free remote resources. By avoiding any “complicated” work in the dtor, you also protect yourself against calling Release() on a set of objects in any order. If all the dtors are trivial, then it won’t matter what order they’re done in. To reduce dtor complexity, consider:

a. using an IDisposable pattern to Offload complicated work.

b. having the dtor queue a cleanup function to be executed at a safe time.