Using AppDomain Isolation to Detect Add-In Failures [Jesse Kaplan]


One of the nice aspects of isolating add-ins in a different AppDomain is that it makes it easier to prevent failures in the Add-In from impacting the host and at the same time improves your ability to detect when a problem has occurred and which piece of code was the culprit. Developers of extensible applications will often quickly discover that a major source of support calls are actually bugs in the add-ins rather than in the application itself: thus being able to identify the failing add-in is an important of many extensible applications.


What actions an application decides to take once it has detected a failing add-in is up to an individual developer, but the pattern for detecting these failures will largely remain the same. There are three major ways an add-in can cause problems for a host: machine state corruption, unhandled exceptions, and resource exhaustion. Machine state corruption can be addressed by sandboxing add-ins with a limited permission set and thus isolating them from the machine. We’ll cover resource exhaustion at a later point, but today’s focus will be unhandled exceptions.


There are actually two types of exceptions that a host has to be wary of: those that are thrown by the add-in during a call into it by the host and those that are thrown by add-in code on threads originating from the add-in itself. The first class are easier as all a host has to do is put a catch block around calls into the add-ins and then decide how to deal with them. Unhandled exceptions on add-in originated threads are harder because the host isn’t on the stack and can’t catch the exceptions. Starting with the CLR v2.0 unhandled exceptions on child threads will now cause the entire process to be torn down and thus it is impossible for a host to completely recover from this. With a little work though it can detect which AppDomain and, assuming it gives each add-in its own domain, add-in caused the problem and log the failure before exiting and even restarting.


In the attached sample we utilize the AppDomain.UnhandledException event and the AddInController class to add failure detection logic to our existing calculator sample by logging failures to disk and tagging add-ins that have previously crashed as potentially unreliable. In this sample most of the interesting work is done in the UnhandledExceptionHelper class inside the host’s project. Each time the host activates and add-in it calls into this class passing in the instance it receives back from AddInToken.Activate. The UnhandledExceptionHelper class will then use the AddInController class to associate that add-in with the AppDomain it was created in and will detect and log failures in that domain as being caused by the add-in activated in it.


The most interesting part of this sample is the fact that it only works for hosts running in the default AppDomain. This is fine for exe based hosts but isn’t so great for others and I have to admit it is not the way I originally intended to write this. The semantics of the UnhandledException event is actually a little odd and pushed me down this path. If you subscribe to this event on the default AppDomain it will file whenever any exception in the process goes unhandled and you need to cast the “sender” parameter to AppDomain in order to discover the originating domain. If you subscribe to the event on other domains it will fire only for exceptions that when unhandled in threads originating in that domain: this is almost what you want in the case of add-ins except for the fact that the delegate fires in that AppDomain and requires that the subscribing type be loaded in that domain. This of course is generally frowned upon in the context of a host as it means loading host code in the add-in’s domain (there are still valid ways to do this but you have to be very careful).  I’ll post a sample showing how to subscribe to this event from non-default domains down the line, but for now most hosts will be able to use the patterns in this sample to detect failing, AppDomain isolated, add-ins and take appropriate actions.  


 


Note: The attached sample was built for a pre-RTM version of .NetFX 3.5 and will not work on the RTM build. For an updated sample please see our codeplex site here: https://www.codeplex.com/Release/ProjectReleases.aspx?ProjectName=clraddins&ReleaseId=9455

ReliableExtensibleCalculator.zip

Comments (5)

  1. Udi Dahan says:

    When you say that other hosts don’t support this model, are you including IIS? If so, which version? 6? 7? What about WAS? What about the new Workflow Foundation runtime?

    We use AppDomains for the cases when we need the ability to unload assemblies at runtime. But one of the issues we’re dealing with is communication from that AppDomain to data that belongs to the primary AppDomain. The threading issue you mentioned above is extremely important to get right in this scenario.

    Any advice or patterns you can point out would be most helpful.

  2. The sample above only works when the host code is running from the default appdomain: this excludes cases where the host is running in places like IIS or from WAS.

    There are ways to get the same functionality in cases there the host isn’t running in the default domain. Please stay tuned…

  3. Last time we discussed some issues to be aware of when trying to build hosts that are resilient to failures

  4. Michael Vainer says:

    Jesse, if I understand you correctly there is no way to completely recover from add-in child thread error but only log the exception?  Will this issue be handled in .Net 3.5 ? It is very important to keep the system running even though one of the add-ins malfunctioning.

  5. Alan Parker says:

    We have a window service that calls on add-in assemblies to do work.  These assemblies are loaded in their own AppDomain, can start their own threads which could throw, but then I am able to catch those unhandled exceptions in both the child AppDomain and default AppDomain’s UnhandledExceptionEventHandler.

    However, I need the service to recover from this.  I was hoping to just unload the child domain and start it up again.  You suggest that there is a way to restart the application.  How would one do that?  Any suggestions for dealing with this scenario so we can keep the service running for a long time?  How does the IIS WP keep itself alive in similar situation (or does it also unload)?