Add-In Performance: What can you expect as you cross an isolation boundary and how to make it better [Jesse Kaplan]

Article
02/22/2008

Aren’t AppDomains too Slow?

“Aren’t AppDomains too slow?” This is one of most common questions we get when we recommend AppDomain or Process isolation to someone building an extensible application. For some applications the answer is yes, which is why we make it easy to activate without isolation as well, but for many more applications you’ll find the performance hit you take by moving to AppDomain or Process isolation is well within the acceptable range for your requirements. In this article we’re going to talk about the level of performance you can expect in different scenarios and what you can do to your application, object model, and pipeline to speed things up where necessary. This article is all about optimizing communication once everything is up and running, but the next one is going to discuss the performances of getting everything up and running using different levels of isolation.

About the Numbers

All of the numbers I’m providing here are only designed to give you a rough idea of what to expect. They were recorded in an unscientific manner (read: on my main box with lots of other applications running) and I typically saw a range of += 10% on different test runs. The application was compiled in VS in retail mode and was run without a debugger. Nothing was NGEN’d and only the contract assembly was installed into the GAC.

Baseline for Performance Expectations

Before we get into details on the guidance we should establish the baseline of what type of performance you can expect with different types of calls across an AppDomain boundary without any interesting optimizations. For simple calls that pass and return primitive types you should see between 2000 and 7000 calls per second across an AppDomain boundary. For calls that involve passing or returning a type derived from IContract across the boundary you should see about 1000 calls per second. These numbers decrease by about 30% when you switch from an AppDomain to a Process boundary.

This means that even without any interesting optimizations you should be pretty comfortable with crossing the boundary many times in response to any sort of user input and shouldn’t have to worry about the user noticing slow-down caused by the isolation boundary.

General Guidance on Performance across Isolation Boundaries

There are two basic strategies you should utilize in reducing the impact the isolation boundary has on your application:

1. Maximize the speed of crossing your particular boundary

2. Reduce the number of times you cross the boundary

Maximize the Speed of Crossing the Boundary

The key insight you need to understand this bullet is that when using System.AddIn to cross isolation boundaries there are three different remoting paths that you might be taking across those boundaries:

1. Cross Process

2. Cross-Domain

3. Cross-Domain FastPath

If you are going across process boundaries then you don’t have much choice on your path and your main goal, when you need to increase performance at the boundary, is to minimize the number of times you need to cross it. On the other hand if you are only using AppDomain isolation then you have more options. There is a good chance that this is the first time you have heard of the “Cross-Domain FastPath”: the FastPath isn’t exposed as an API or really documented extensively. The FastPath boils down to a highly optimized short-cut across AppDomains which by-passes as much of the remoting stack and bookkeeping as possible. The runtime uses this short-cut whenever it can quickly prove that a given call does not require any of those remoting facilities.

There are a few things you need to do to enable the FastPath for your application, but once you enable it the runtime will choose between it and the standard cross-domain path on a per transition basis depending on the type of call you are making. Therefore your goal, if you need to increase your performance crossing an AppDomain boundary, is to first enable the FastPath for your app and then to try and stay on it for as many of the performance-critical transitions as make sense.

To enable the fast path you need to make sure that the runtime shares your Contract assembly between your host and add-in AppDomains. To do this you need to mark the main method of your exe with [LoaderOptimization(LoaderOptimization.MultiDomainHost)] and to install your Contract assembly in the GAC. The “MultiDomainHost” loader optimization tells the runtime to share all assemblies that are installed in the GAC between AppDomains. If you have a native executable and are using the native hosting APIs to start up the runtime you can still enable “MultiDomainHost”: see these docs for more info on exactly how to do it.

Once the fast path is enabled you need to consider which calls you make will actually use it. Basically you will only stray from the fast path on calls that involve either passing or returning an IContract: IContracts take you off the FastPath because they rely on MarshalByRefObject which actually does require a hefty chunk of the remoting bookkeeping. Nearly all calls involving only serializable types will stay on the fast path. The main restriction for keeping serializable objects on the FastPath is that they are marked serializable using the [Serializable] attribute. If you instead implement ISerializable you will end up using the normal path. Serializable structs will also stay on the fast path and thus are a good option for passing large amounts data across the boundary when by-value semantics are appropriate.

Reducing the Number of Times you Cross the Boundary

It is important to realize that the cost of crossing an AppDomain boundary does not grow substantially as the amount of data you pass across increases. For example, assuming no other optimizations, you can pass a byte[100000] in about twice the time it takes to pass across a byte[10]; in this case an increase in data of 10,000 times only results in a 50% slowdown. If you want to look at a MB per second measurement passing across bytes in byte[10] chunks leaves you with a rate of about 0.029 MB/second and passing them in byte[10000000] chunks leaves you with a rate of about 406 MB/second. If you take advantage of the fast-path those numbers change to 0.75 MB/sec with byte[10] and about 560 MB/sec with byte[10000000].

Now, there are some cases where passing data in byte[] form is very natural (image processing comes to mind) but in many cases you’ll want something that is much easier to consume on the other side. In these cases passing data across in the form of structs can be very useful. Serializable structs can use the fast path across AppDomains. This, combined with the fact that they can contain large amounts of data and can even be passed across efficiently in arrays, means they are very useful in passing large amounts of strongly typed data (as opposed to byte arrays) across boundaries with a minimal impact to performance. The biggest downside to structs when compared to byte arrays is that if you are unable to enable the fast path then the performance of passing large numbers of structs degrades significantly: it is still much faster than using standard IContract calls but it may not be appropriate for passing around very large amounts of data. In our sample you’ll see that we achieve 2.1 MB/S for a struct[10] and 134 MB/S for a struct[10000000] if the fast path is enabled. Falling off the fast path leaves us at 0.085 MB/S for struct[10], 0.297 MB/S for struct[100000], and a DNF (did not finish before I gave up on it) for struct[10000000].

Strings were the last method I used for passing data across the boundary and led to some of the more interesting results: with very large strings my test bed was reporting “infinity bytes per second.” Increasing the size of the string did not increase the time it took to pass it across the AppDomain boundary regardless of whether I enabled the fast path or not. Think for a second about why that may be. As it turns out, strings are special cased by the runtime and shared across all AppDomains in the process. Thus to pass one across the boundary all the runtime needs to do is perform a pointer copy, whose speed doesn’t depend on the size of the string being pointed to. This means that if you have a system where you need to pass large amounts of text around then you’ll typically not have to worry about the size of that text when considering the performance of your AppDomain isolation boundaries.

The Data

The data below was gathered using the sample. To run the tests for yourself you’ll just need to download the sample, build it, install the contract assembly into the gac (gacutil –if [contract path]), and run the app outside of a debugger. There are comments in Program.cs as to how to modify it to test the different scenarios below. In addition to the tests below there are additional methods in the sample that you can experiment wit.

You can find the sample on our codeplex site here.

Bringing it all Together

We’ve just gone into a lot of detail and discussed a lot of different options for increasing performance and now that that’s out of the way I think we can distil it into a simple cheat sheet:

DO: enable sharing of assemblies using the LoaderOptimization attribute.
This is the first step to enabling the fast path. Even if you do not install your contract assembly into the GAC (and thus fully enabling the fast path) you will see an decrease in the activation time of your add-ins beacause this allows the Framework assemblies to be shared between domains.

DO: install the Contract assembly in the GAC when using AppDomain isolation.
This step along with the LoaderOptimization attribute will enable the fast path for your application and will provide dramatic performance increases for many of your calls across the AppDomain boundary. If you are exclusively using Process isolation this is not important.
Note: this step means that your installation program will require admin privileges.

Consider: passing data across boundaries using by-value semantics (structs, byte[],strings, ect).
If you are passing large amounts of data between your host and add-in you should consider passing it across in one chunk using a by-value data-type rather than in pieces using a deep IContract based hierarchy. This is approach is very natural in cases where an add-in needs to process and possibly modify the data: just pass it out to the add-in and have the add-in return the modified set. It can be more awkward if the data is dynamic and can change continuously but using a system of events to notify the data consumers of changes can help mitigate this.

DO NOT: dismiss an isolation level without first giving it a try in the scenarios you care about.
Many times you’ll find that in a real application the performance hit you take by moving to an isolation boundary isn’t as much as you expect and perfectly acceptable for your situation. In the cases where it is not, there are a few simple things you can do to make the experience better.

One of the things I hope you get from this article is that for many extensible applications an AppDomain boundary, or even process, boundary is just fine without any performance tweaking and that for those applications that need (or want) better performance there is a range of options for providing dramatic performance increases depending on your needs.

Add-In Performance: What can you expect as you cross an isolation boundary and how to make it better [Jesse Kaplan]

Additional resources