This post describes a problem we encountered and solved during the development of Visual Studio 2010 when we rewrote some components in managed code. In this post I’ll describe the problem and what we did to solve it. This is a rather lengthy technical discussion and I apologize in advance for its dryness. It’s not essential to understand every detail but the lesson we learned here may be valuable for others working with large “legacy” code bases.
As I mentioned in the Background, we rewrote several components for Visual Studio 2010. Specifically, the window manager, the command bars and the text editor. In previous versions of Visual Studio, these were native components written in C++. In Visual Studio 2010, each was rewritten in managed code using C#.
Extensions which plug into Visual Studio communicate with these components through COM interfaces. Moving to managed code doesn’t change how these extensions communicate with platform components. Indeed, that’s the promise of interface based programming – i.e. you don’t need to know implementation details in order to communicate with a component via its interface. In the case of the new editor, it so happens that we introduced a new, managed programming model for new extensions, but even so, we had to keep the existing COM interfaces for older extensions.
Managed code and COM are brought together through the magic of COM Interop. Briefly, this allows two things to happen:
- Managed code can make calls through a COM interface just as if they were a regular method calls.
- Managed classes can implement COM interfaces so that they may be called as COM objects from either native or managed code.
Let’s take each of these in turn. (If you already know how COM Interop works, you can skip the following two sections.)
Calling COM objects from Managed code
The Common Language Runtime, CLR or just the “Runtime”, can make a COM object look just like a regular managed object. This is a special kind of object called a “Runtime Callable Wrapper” or RCW. RCWs bridge the managed, garbage-collected world with the native, ref-counted world. An RCW is created when “an IUnknown enters the runtime” (IUnknown is the minimum interface that all COM objects must implement). When does that happen? Usually, as the result of an interop call to a native method which hands back a COM interface. In fact, typically, it’s the result of method call on an existing COM object. Since that sounds a little bit like a “chicken and egg problem”, let me give a concrete example. At the heart of the Visual Studio platform lies the Global Service Provider. This service provider keeps track of services offered up (“proffered”) by components in the system. Other components can request a service by calling the IServiceProvider.QueryService method on the Global Service Provider object. If successful, the service returned to the caller will be another COM object, identified by a pointer to its IUnknown interface. If the component making the QueryService call is managed then, at the point where that pointer enters the Runtime, an RCW is created for the service. Of course, this still begs the question: “How did the managed component get hold of the Global Service Provider?”. The answer is that the Global Service Provider was passed to the managed component by the platform when that component was first initialized.
Implementing COM interfaces on managed objects
The tools and the Runtime make this very easy. To implement a COM interface on a managed object, you first need to locate or create an interop assembly containing the managed equivalent of that COM interface. By referencing that interop assembly from managed code and writing classes which implement those interfaces, you create COM compatible managed classes. There are a few other requirements (e.g. your classes must also be marked as COM visible either at the assembly level or on a per class basis), but otherwise it’s straightforward. When an instance of once of these classes is passed through the interop layer to native code, the CLR creates a COM Callable Wrapper or CCW. The CCW, among other things, preserves all the COM rules about identity and the lifetime of the wrapped object. For example, for as long as at least one native component holds a reference on the CCW, then the underlying managed object cannot be claimed by the Garbage Collector, even if there are no other managed roots. As far as the native code is concerned, it deals with an IUnknown, unaware that the object is really a managed object.
Marshal.ReleaseComObject – a problem disguised as a solution
With that rather lengthy (my apologies) recap of COM Interop out of the way, let me describe the problem. Imagine, for the sake of a simplified example, that you have a component called the “Text Manager”. The Text Manager, as you might guess, handles requests about textual things in an editor. Other components communicate to the text manager via the ITextManager interface with methods such as “GetLines”, or “HighlightWord”. ITextManager is a COM interface. Now imagine that there’s a second component that implements a “Search” facility for finding words in a document. The Search component is written in managed code. Obviously, this Search component will need access to the Text Manager to get its job done, and I’m going to lead you through the scenario of performing a “Find” – once when the Text Manager is implemented in native code, and a second time when the Text Manager is managed.
The ‘find’ operation begins with the Search component asking for the Text Manager service via the Global Service Provider. This succeeds and the Search Manager gets back a valid instance of ITextManager. Since, in this first walkthrough, the Text Manager is a native COM object, the IUnknown returned is wrapped by the runtime in an RCW. As far as the Search Manager is concerned, though, it sees ITextManager. It doesn’t know or care (yet) whether the actual implementation is native or managed. The find operation continues with the Search component making various calls through ITextManager to complete its task. When the task is done, the ‘find’ operation exits and life is good. Well… almost. The ITextManager is an RCW and, as such it has the same kind of lifetime semantics as any other managed object – i.e. it will be cleaned up as and when the Garbage Collector runs. If there’s not much memory pressure in the system, then the Garbage Collector may not run for a long time – if at all – and here is where the native and managed memory models clash to create a problem. You see, as far as the Search component is concerned it’s finished with the Text Manager – at least until the next ‘find’ operation is requested. If there were no other components needing the Text Manager, now would be a great time for the Text Manager to be cleaned up. Indeed, if the Search component were written in native code, at the point of exiting the ‘find’ routine, it would call “Release” on the ITextManager to indicate that it no longer needs the reference. Without that final “Release”, it looks like a reference counting leak of the Text Manager – at least until the next garbage collection. This is a special, though not unusual case of non-deterministic finalization.
This is just an example, but situations just like it really happened many times during Visual Studio 2005 and 2008 development. The bug reports would say that ‘expensive’ components were being reported as leaked objects, usually at shutdown. The “solution”, as a few people discovered, was to insert a call to “Marshal.ReleaseComObject” at the point where the expensive component (the Text Manager in our example) was no longer needed. The RCW is released, causing its internal reference count to drop by one and, typically releasing the underlying COM object. No more leaked references and problem solved! Well, at least for now, as we’ll see. Regretfully, once this “solution” appeared in the source code of a few components, it spread rapidly as the ‘quick fix’ for leaked components and that’s how we shipped. The trouble started when we began migrating some components from native code to managed code in VS 2010.
To explain, I’ll return to the ‘find’ scenario, this time with the Text Manager written in managed code. The Search component, as before, requests the Text Manager service via the Global Service Provider. Again, an ITextManager instance is returned and it’s an RCW. However, this RCW is now a wrapper over a COM object which is implemented in managed code – a CCW. This double wrapping (an RCW around a CCW) is not a problem for the CLR and, indeed, it should be transparent to the Search component. Once the ‘find’ operation is complete, control leaves the Search component and life is good. Except that, on the way out the Search component still calls “Marshal.ReleaseComObject” on the ITextManager’s RCW and, “oops!” we get an ArgumentException with the message “The object’s type must be __ComObject or derived from __ComObject.”. You see, the CLR is able to see through the double-wrapping to the underlying component and figure out that the it is really a managed object.
There’s really no workaround for this except to find all the places where “ReleaseComObject” was called and remove them. Some have suggested that, before calling ReleaseComObject we should check first if it’s going to succeed by calling “Marshal.IsComObject” but, as we’ll see in the next section there is another, more insidious problem still lurking.
Marshal.ReleaseComObject – the silent assassin
For this second problem, we’ll return to our original example, with the Text Manager implemented in native code. Even with the ‘safeguard’ of Marshal.IsComObject, the Search component calls ReleaseComObject and goes on its way. However, the RCW has now been poisoned. As far as the CLR is concerned, by calling ReleaseComObject, the program has declared that the RCW is no longer needed. However, it’s still a valid object, and that means it may be reachable from other managed code. If it is reachable, then the next time ITextManager is accessed from managed code through that RCW, the CLR will throw an InvalidComObjectException with a message of “COM object that has been separated from its underlying RCW cannot be used”.
How can that happen? There are several ways – some common and some subtle. The most common case of attempting to re-use an RCW is when the services are cached on the managed side. When services are cached, instead of returning to the Global Service Provider each time the Text Manager (for example) is requested, the code first checks in its cache of previously requested services, helpfully trying to eliminate a (potentially costly) call across the COM interop boundary. If the service is found in the cache, then the cached object (an RCW) is returned to the caller. If two components request the same service, then they will both get the same RCW. Note that this ‘cache’ doesn’t have to be particularly complicated or obvious – it can be as subtle as storing the service in a field (member variable) for later use.
I’ve called this use of Marshal.ReleaseComObject the “silent assassin” because, while the problem occurs at the point of the call to ReleaseComObject, it is not detected until later when another component innocently tries to access the poisoned RCW. At first glance, it appears that the second component has a bug, but it does not – the component that called ReleaseComObject is the assassin and ‘he has left the room’.
The lesson here is: If you’re tempted to call “Marshal.ReleaseComObject”, can you be 100% certain that no other managed code still has access to the RCW? If the answer is ‘no’, then don’t call it. The safest (and sanest) advice is to avoid Marshal.ReleaseComObject entirely in a system where components can be re-used and versioned over time. While you may be 100% certain of the way the components work today and believe that a ‘poisoned’ RCW could never be accessed, that belief may be shattered in the future when some of those components’ implementations change.
Fixing the mistake in Visual Studio 2010
In VS 2010, we scrubbed our code for instances of Marshal.ReleaseComObject and asked component authors to either remove or justify each occurrence. In our own code we found many instances, including in common library code used by managed packages. We were so concerned about the problem of running these components that we actually created patched versions of our Managed Package Framework for VS 2005 and VS 2008 so that, when loaded in VS 2010 they would not have ReleaseComObject problems. You’ll see these patched versions appear as binding redirects in “devenv.exe.config” for Microsoft.VisualStudio.Shell and Microsoft.VisualStudio.Shell.9.0.
What’s Old is New Again
Microsoft Distinguished Engineer, Chris Brumme, offered some sage advice about Marshal.ReleaseComObject back in 2003. It’s worth a read because it shows that we were thinking about this problem way back then. In case it isn’t obvious, Visual Studio is in category #2 on Chris’ list at the end of the post.
Mason Bendixen’s Blog also has a nice collection of notes on COM interop and, in particular, this one on RCWs is germane because it talks about the per-AppDomain mapping of IUnknowns to RCWs.
Paul Harrington – Principal Developer, Visual Studio Platform Team
Biography: Paul has worked on every version of Visual Studio .Net to date. Prior to joining the Visual Studio team in 2000, Paul spent six years working on mapping and trip planning software for what is today known as Bing Maps. For Visual Studio 2010, Paul designed and helped write the code that enabled the Visual Studio Shell team to move from a native, Windows 32-based implementation to a modern, fully managed presentation layer based on the Windows Presentation Foundation (WPF). Paul holds a master’s degree from the University of Cambridge, England and lives with his wife and two cats in Seattle, Washington.