Enterprise Level Deployment of VSTO Solutions: Deployment Design

It's been a while, I'm not really interested in going to the details of why at the moment, lets dig right into the topic.

Generally I focus on more of the technical aspects of what you can do without really talking much about some of the other considerations that should go into make your deployment design decisions.  This post is going to be a little different.  I won't have any nuggets of deployment code or advice on how to achieve specific deployment actions, rather I'm going to cover some of the details that you might want to include into your evaluation before you decide "how" you will deploy your solution.  Essentially this post will be about how to go about designing your deployment (what factors you may have to consider and how they will impact your deployment implementation).

First, some terminology:  I have been known to throw around the following terms and haven't really defined them in any consistent way, so I'm going do so now.

Customer level solution (customer solutions, customer deployments):  Solutions in this class of products that are released to the general public.  Generally as far as deployment goes, Customer solutions are deployed by some kind of opt-in methodology.  Customer solutions have constraints unique to not always having control of the system in which the solution is installed into.

Small Business level solution (smb solutions, smb  deployments) :  Solutions in this class of products that target user bases encompassing 10 - 200 clients.  Deployment and installation state may be controlled by a central organization or individual.  SMB solutions generally have some moderate to low-level infrastructure such as an intranet and UNC servers.

Enterprise level solution (enterprise solutions, enterprise deployments):  Solutions in this class of products will target a set user base of 200-100,000 clients.  Heavy infrastructure is available and both deployment and installation state are controlled by a central organization (IT department) with dedicated resources for management and support.  Constraints on this level are often defined by such issues as infrastructure constraints, machine state preservation and maintenance impact.

I'm going to focus on the enterprise solutions and some of the specific details that an IT department may impose on deployment configuration and maintenance decisions.  Enterprise solutions should have 4 specific phases of a solution's lifetime that should be considered before making deployment design decisions.  These phases are initial rollout, solution maintenance (updating), emergency response and uninstall.  

Initial Rollout:

When considering Initial Rollout, there are a number of factors that should be considered.  One of which is the bandwidth-to-client metric (versus solution size) and the window of opportunity for rollout.  Simply put, how long is it going to take to for every client to download the solution the first time.  Here are some numbers and math.

First, assumptions and caveats:

All numbers here are idealized, actual networking is never as nice as this idealized form.  The assumption for the following graphs is that you are hosting all of the data at 1 machine and the network bandwidth is the idealized out-bandwidth of the publish server.  VSTO solutions may range in size from 100KB to 50MB or more depending on any number of factors involved.  As a reasonable medium point I'm choosing 5MB with the assumption that most solutions will sit somewhere within this upper bound. 

Next, the formula:

Rollout Time = Amount of Data / Speed of Data

Rollout Time (minutes)  = ([number of installs] x [Installation Size]) / ([Bandwidth] x 0.125  x 60)

Bandwidth is generally stated in megabits per second (mb/s), but we need to calculated based on megabytes (MB) and minutes thus the multiplication by 1/8 and by 60.

Finally, Data:

Solution only installation (no prerequisites)
100 megabit network

Client Machines

Network Bandwidth (mb/s)

Installation Size (MB)

Approximate Rollout Time (minutes) 

200 100 5 2
500 100 5 4
1000 100 5 7
5000 100 5 34
10000 100 5 67
1 gigabit network

Client Machines

Network Bandwidth (mb/s)

Installation Size (MB)

Approximate Rollout Time (minutes) 

5000 1000 5 4
10000 1000 5 7
50000 1000 5 34
100000 1000 5 67

However these are the "best case" roll out scenarios give that you may not have all of the prerequisites.  Consider that the following are common prerequisites that you may have to also install:

Office 2007 PIA redist:  6.82 Mb

VSTOR 3.0 SP1 redist: 3.27 Mb

Solution, Runtime and PIAs
100 megabit network

Client Machines

Network Bandwidth (mb/s)

Installation Size (MB)

Approximate Rollout Time (minutes) 

200 100 15 4
500 100 15 10
1000 100 15 20
5000 100 15 100
10000 100 15 200
1 gigabit network

Client Machines

Network Bandwidth (mb/s)

Installation Size (MB)

Approximate Rollout Time (minutes) 

5000 1000 15 10
10000 1000 15 20
50000 1000 15 100
100000 1000 15 200

What we can draw from these numbers? Nothing specific, these values are based on idealized networking and don't take into consideration a lot of factors that may have a greater impact on your deployment design.  What they do give us though is a basic framework to use as a basis for further discussion.

Specifically, as your organization is bigger or has less infrastructure, there may be value in breaking up the initial roll out into waves or separate deployment sites.  If you have an organization of 10k or more employees, there is a pretty high chance you can use geographic based division to save a lot of pain and allow you to handle the initial rollout in a controlled manner.

Let's move on and talk about some of the other factors that should influence the rollout.  One things that comes to mind in particular is the prerequisites.  By default VSTO publishes a Setup.exe file that is meant to be used to pull prerequisites down to client machines that don't have them installed.  If the prerequisites are not present you may want to roll them out first and possibly even rollout a "dummy" solution that allows you to track the number of successful prerequisite setups.  When you then push out the "real" solution, you only need to worry about the solution itself and not the problems that would be caused by failures in the prerequisites. 

Work habits is another factor you have to consider for initial rollout.  You may consider pre-registration of an Add-in as possible Rollout mechanism, but if you do you should be aware of the impact this specific factor may incur.  Specifically if your clients all connect at the same time or if it is staggered over a period of time there may be a much higher bandwidth cost.  An example graph below compares relative network usage versus Initial connection of clients (the data is made up but it demonstrates the basic idea).

In this example:

In this example we have a case where the bandwidth usage of installing the solution on all clients exceeds the capacity of the network infrastructure.  In both "scenarios"  the bandwidth speed and data served is the same, but the final results (and subsequent customer impact)  differ because of usage patterns. 

Scenario A is where a natural Bell curve occurs between 8:00 and 8:30.  Initially only a couple of clients start the installation process at 8:00, at 8:05 more connect, at 8:10 the highest number of clients are connecting (about 50% of the total) and then as time progress fewer clients are connecting. 

Scenario B is where All clients connect at the same moment (at 8:00)

image

The key take-away from this example should be the effect that spreading out the install step can have.  In both cases the network usage peaking is causing a delay, but in one case the delay is at least 15 minutes in length for some clients.  If you're in a situation where your clients have common peak times (specifically around startup) you will want to reduce the amount of potential impact your rollout will have.

Now that we've established some of the factors, lets talk about some specific design choices you might make based on how these factors impact your specific deployment story. If you're in an enterprise that is just beginning to rollout .NET 3.5 (or 3.5 SP1) and the VSTO prerequisites, you probably specifically want to roll out the prerequisites ahead of time to ensure that when you finally do roll out your specific solution code there are fewer kinks to work out in both steps.  Using a "test" add-in (specifically on the Application you plan to customize) may help flush out any issues related to initial prerequisite installations.

There are 2 basic paths you might take to propagate the solution out to your clients. 

The ClickOnce method may be used by forcibly installing the prerequisites (which are machine level and require administrative access) and then pushing a ClickOnce registration (of and Add-in) to all user accounts  (for documents, simply making the documents available is the same as registering the add-in).  The first time your solution consumers login on a client machine and start the customized application (or open a customized document) Clickonce will pull down the solution (from the location you specific in the registration) and store in the cache on that machine.  Depending on your user statistics this can result in a installation really bad peaks.  If you find that you might have this kind of peaking behavior there are some specific options you might employ to reduce the rollout impact.  One option is to stagger the registration step.  By introducing the registration in waves you can reduce the "height" of the peaks to something that doesn't significantly impact your user base.  Another option with particularly large rollouts might be to host the solution on different servers.  Doing this breaks up the load but keep in Mind VSTO ClickOnce doesn't really have a good mechanism to allow you to load balance these solution.  Once an installation points to a specific deployment (publish) server, it will always point to that server.  An uninstall and reinstall (on each client) would be necessary to migrate client installations.

The vstolocal method would be the alternative.  Similar to (or possibly using) MSI installations you would copy the customization contents to a specific location on the machine and then push down a registration with the |vstolocal tag.  Using this method would allow you to be more specific about "when" the solution is installed.  With the vstolocal method, the "hard" work is all in getting the client computer properly configured once you've achieved that, the subsequent rollout should just be a matter of "when".

On trusting the solution:

In both cases, you may want to use certificate trust since it allows you to trust the solution for all users on that machine.  If you do not use a trusted publisher certificate approach, each of your users will incur a trust prompt on the initial execution of the solution on each client (example: Joe would see it on Client A and Client B but Bill who only uses Client B for the customized app only sees the prompt on Client B (though both will see it on Client B)).

Solution Maintenance:

Depending on the nature of your solution you may be planning regular updates, infrequent updates or no updates at all.  When considering the design of your deployment story, you should consider how updating is going to impact your usage scenarios.

When using "vstolocal", updates are entirely based on a push model where you have to push a new version of the manifests and  executable files to each client machine.  Every update in this model will pretty much have the same impact as incurring a solution only initial rollout.

If you use ClickOnce, there are several options for managing updating.  ClickOnce itself allows for determining the frequency of update checks but there are scenarios with this that you may want to consider.  Generally the cost of checking shouldn't be significant issue, when a check is made, only the deployment manifest is downloaded to the client.  Most manifests should be in the 6k - 10k size which means even with a 100 mb/s connection, a 7 second delay would only occur if around 9000 clients attempted to check at the same moment.  It's not really the update check that is likely to cause problems, its those times when an update actually exists to be downloaded (the amount of data and subsequently bandwidth is a magnitude larger).  The same basic network metrics and usage patterns in the initial rollout may have similar impacts on your usage experience so it's a factor you should consider in your design.

One possible method of reducing spiking on "update days" is to stagger when the update checks occur.  This prevents spiking but it also delays when all clients will have the update and can be problematic if older clients can cause problems with newer clients when they share data sources (ex:  your solution access a Database and an update includes a Database Schema change that causes issues when older clients attempt to manipulate data). 

Another potential flaw with staggering updates is relating to a natural tendency for "bunching".  This is best explained as an example: let us just assume for the sake of this example, profiling indicated that having every client download updates on the same day would over-tax the resources available.  So to mitigate the problem the initial rollout occurred over the course of a week (5 days) and the each client was set to check for updates every 7 days.  So each day of the week, 20% of your clients check for updates first thing in the morning (an assumption that every client opens the application at least 1 time every  (work)day is made here).

The problem occurs whenever vacations or Holidays occur.  Each day that users are not logged into client machines the scheduled checks (for the missed day) to check on the next day.  What this ends up doing is causing bunching up (most likely around the middle of the week).  Here is little example graphic to demonstrate what this might look like:

image

Initially in the first week all clients are checking in a consistent pattern and no spike in usage occurs.  However on the second week, Monday and Friday are both Holidays.  On Tuesday 40% of the clients are checking for updates since the 20% from Monday were unable to check on their "designated" day.  Additionally on the third week, due to Monday and the previous Friday being holidays, now 60% of the users are checking for updates on Tuesday (and 0% are checking on Friday and Monday).

One mitigation for this is to delay checking for updates by a larger chunk of time (3 weeks for example) and providing a bigger buffer between each "wave" of updates.  There are other methods you might work around this behavior (rather than using automatic checking, using the ClickOnce API and controlling when update checks occur similar to an earlier blog post I had about creating a "check for updates" button).

Emergency Response:

While no one wants to ship a bug or have cases when a solution has been compromised by a virus or hacker, it is important to be aware of what you can do to mitigate the impact when these things occur.  By having a clearly designed response ahead of time, it may be possible to reduce the severity of the impact on your business operations by reducing the response time and window of opportunity for damage.

If you use vstolocal there are 2 specific methods you can take to mitigate a bad customization.  The first option is to disable the solution.  There are 2 methods:  force an uninstall on user startup or an alternative is to simply push the solution certificate into the Untrusted Certificates Store.  One of the subtle behaviors of vstolocal is that trust of the solution is always evaluated at startup.  This works from an administrative (machine) level and doesn't require pushing out any action to specific users.

The other option is to for a registry disable of the customization (set the load behavior to 2).  This works to shut down the solution from automatically starting up and causing further harm but it doesn't prevent users from re-enabling the solution via the com-add-ins dialog in office.  Additionally, disabling the add-in in this method requires pushing the change to every user on every client. 

The second method for vstolocal solutions you might take is to simply force an uninstall of the solution, this prevents the solution from being re-enabled by the user but again incurs a user level action.  This depends on good authoring of your deployment mechanism (I'm assuming this to be an MSI).

If you deployed using ClickOnce, you can't use the Untrusted Certificates store to block/disable existing installs (of the bad update) since security for ClickOnce solutions is only evaluated when we copy the solution contents into the cache.  However you can push out commands to VSTOInstaller to force an install or alter the user level registry to disable the solution (same as described for vstolocal). 

If your solution checks for updates on every run though, you can push out a rollback by simply replacing the "bad" version with the version published previous to it.  Any users who had not updates will not see any effect (they will not have pulled down the "bad" update) and any users who installed the bad version will be rolled back to a working version.

The last option may be to write some code in the initialization of the customization to check for a "halt" registry key in the hklm registry hive.  This allows you to push out a single key to every client machine disabling the solution for every user on that machine...but it does require designing this functionality into the solution immediately. 

Uninstall:

Preparing for the end of life for your solution is part of proper deployment design.  When you are looking at Enterprise level solutions, you need to determine how responsive you want the uninstall process to be.  If you use vstolocal deployment, the work for uninstall is achieved by removing the solution files and then further forcing an deletion of the registration of the add-in for each and every user.  

With ClickOnce Installation, there isn't a specific mechanism built-in for pushing an uninstall out to every user, however VSTOInstaller does support a command line only uninstall.  The command would be something of the format:

%commonprogramfiles%\microsoft shared\VSTO\9.0\VSTOInstaller.exe /Silent /U "{publishlocation...Addin.vsto}"

You probably want to use the silent flag specifically to prevent your users from blocking the uninstall process.

Similar to handling an "emergency response" situation, you may simply want to bake a registry check to cause the solution to uninstall itself.  The problem with this method though is the registry key is left around (assuming it is in the hklm registry hive).

Another possible method for handling the final uninstall:  publish an update that calls VSTOInstaller to uninstall at add-in shutdown.  This this should work because fusion copies the solution executables into a shadow cache during execution (although you may want to ensure no references to (data) files exist at the moment (in execution) you call uninstall).

Thanks for Reading (and being patient enough to stay with me over a long break)

Kris