Using AppDomain Isolation to Detect Add-In Failures [Jesse Kaplan]

One of the nice aspects of isolating add-ins in a different AppDomain is that it makes it easier to prevent failures in the Add-In from impacting the host and at the same time improves your ability to detect when a problem has occurred and which piece of code was the culprit. Developers of extensible applications will often quickly discover that a major source of support calls are actually bugs in the add-ins rather than in the application itself: thus being able to identify the failing add-in is an important of many extensible applications.

What actions an application decides to take once it has detected a failing add-in is up to an individual developer, but the pattern for detecting these failures will largely remain the same. There are three major ways an add-in can cause problems for a host: machine state corruption, unhandled exceptions, and resource exhaustion. Machine state corruption can be addressed by sandboxing add-ins with a limited permission set and thus isolating them from the machine. We’ll cover resource exhaustion at a later point, but today’s focus will be unhandled exceptions.

There are actually two types of exceptions that a host has to be wary of: those that are thrown by the add-in during a call into it by the host and those that are thrown by add-in code on threads originating from the add-in itself. The first class are easier as all a host has to do is put a catch block around calls into the add-ins and then decide how to deal with them. Unhandled exceptions on add-in originated threads are harder because the host isn’t on the stack and can’t catch the exceptions. Starting with the CLR v2.0 unhandled exceptions on child threads will now cause the entire process to be torn down and thus it is impossible for a host to completely recover from this. With a little work though it can detect which AppDomain and, assuming it gives each add-in its own domain, add-in caused the problem and log the failure before exiting and even restarting.

In the attached sample we utilize the AppDomain.UnhandledException event and the AddInController class to add failure detection logic to our existing calculator sample by logging failures to disk and tagging add-ins that have previously crashed as potentially unreliable. In this sample most of the interesting work is done in the UnhandledExceptionHelper class inside the host’s project. Each time the host activates and add-in it calls into this class passing in the instance it receives back from AddInToken.Activate. The UnhandledExceptionHelper class will then use the AddInController class to associate that add-in with the AppDomain it was created in and will detect and log failures in that domain as being caused by the add-in activated in it.

The most interesting part of this sample is the fact that it only works for hosts running in the default AppDomain. This is fine for exe based hosts but isn’t so great for others and I have to admit it is not the way I originally intended to write this. The semantics of the UnhandledException event is actually a little odd and pushed me down this path. If you subscribe to this event on the default AppDomain it will file whenever any exception in the process goes unhandled and you need to cast the “sender” parameter to AppDomain in order to discover the originating domain. If you subscribe to the event on other domains it will fire only for exceptions that when unhandled in threads originating in that domain: this is almost what you want in the case of add-ins except for the fact that the delegate fires in that AppDomain and requires that the subscribing type be loaded in that domain. This of course is generally frowned upon in the context of a host as it means loading host code in the add-in’s domain (there are still valid ways to do this but you have to be very careful). I’ll post a sample showing how to subscribe to this event from non-default domains down the line, but for now most hosts will be able to use the patterns in this sample to detect failing, AppDomain isolated, add-ins and take appropriate actions.

 

Note: The attached sample was built for a pre-RTM version of .NetFX 3.5 and will not work on the RTM build. For an updated sample please see our codeplex site here: https://www.codeplex.com/Release/ProjectReleases.aspx?ProjectName=clraddins&ReleaseId=9455

ReliableExtensibleCalculator.zip