Managing Custom Solutions for Disaster Recovery SharePoint Farms

Having x2 SharePoint farms that need to be synchronised perfectly does raise the question about how to maintain both farms updated with any custom solutions that may need to be applied to the farm/application. This guide is about how to maintain custom solutions across two or more farms and what can happen if it’s not done correctly. Disaster Recovery farms (DR farms) are the ultimate high-availability guarantee because of their completely independent nature by design, but they do need work to function correctly, especially if WSPs (custom solutions) are involved.

The short answer to managing custom solutions across a production + disaster recovery (DR) farm is that you need to deploy all solutions to both farms, albeit with features only activated on one. The rest of this guide is an experiment to see how the DR farm will not work if deployments are either not done, or not done correctly. For this experiment we’re building on the infamous Pet Store Visual Studio built application as demonstrated here. It has a custom master-page & default page and some list definitions; we’re going to add some more coding based items – some custom code on our default.aspx page for simplicity (instead of a visual web-part) and a list event receiver, and then we’re going to see how we should be deploying updates to these items and finally what happens to our code & site when we don’t sync.

If you want to skip the part of this guide related to actually building the solution in Visual Studio, skip to “Setup Log-Shipping to Disaster Recovery Farm & Test” further below.

Building Example Custom-Code Elements

For the first stage we’re going to build x2 code items for our site for this demo – one as a page-embedded button to tell us the time and another, an event-receiver that’ll attach to our lists to manage the data going into those lists. It’s simple and will demonstrate why some code works without a full deployment. As mentioned before, I’ll start where we left off in this article - https://blogs.msdn.com/b/sambetts/archive/2013/10/17/creating-a-clean-visual-studio-solution-from-a-sharepoint-2013-site-template.aspx.

We’re going to add code that’ll be deployed into the Global Assembly Cache of the SharePoint server(s) in question so we can test code at all levels – site ASPX, and GAC bound functionality.

Add Event Receiver

First let’s add our event-receiver as that’s probably the simplest to get started with. Add a new item to the Visual Studio solution – select “event receiver”.

image

Give it an appropriate name. Next we need to choose for what we’re going to receive events for – make sure you select “an item was added” (note the past-tense there).

image

We’re just going to code an event that overrides the title of the item that’s been added. The code will be:

/// <summary>

/// An item was added

/// </summary>

public override void ItemAdded(SPItemEventProperties properties)

{

    properties.ListItem["Title"] = "PetStoreSalesReceiver Override";

    properties.ListItem.Update();

}

It’s quick & dirty and in fact completely useless as actual functionality but it proves the code is firing when we add an item with whatever name we want; that’s to say, if we add an item to a custom-list with title “test”, if it appears with the title “test” (as you’d normally expect) then the event-receiver isn’t working.

Add Dynamic Code to Default.aspx

We’re going for the simplest test ever – just enough to see that our custom-code executed as expected so just a snippet to return the current time for this demo. In our default.aspx code we want these lines:

<h1>Code test</h1>

<asp:Button ID="btnUpdateTime" runat="server" OnClick="btnUpdateTime_Click" Text="What time is it?" />

<br />

<asp:Label ID="lblTime" runat="server" Text="Good question..."></asp:Label>

 

<script runat="server">

        protected void btnUpdateTime_Click(object sender, EventArgs e) {

               lblTime.Text = string.Format("It's {0}",DateTime.Now.ToLongTimeString());

        }

</script>

That should be it – page should be ready to go!

At this point you should have a solution that looks something like the following:

image

Modify Custom Template

We need to make sure our new feature activates on site creation by modifying the ONet.xml. As our code is site-collection level, we have to add the feature under “SiteFeatures” of our site-definition as so (in my example):

<!-- Code Features-->

<Feature ID="{GUID}" Name="Pet Store - Code" />

You’ll have to replace GUID with the ID of your new feature of course – find it in the properties of the feature:

image

Test Deploy the Solution

Let’s make sure we’re “ready for production” then…

It’s important to note that for reasons too complicated to get into here the event-receiver at least will/may not work unless its associated feature is activated for a new site-collection. So create a new site-collection and assuming your ONet.xml change was valid the new feature should’ve automatically activated.

Bear in mind, Visual Studio is used to working with existing sites which isn’t often ideal as activating some features may give errors due to us having not cleaned-up properly if files need to be copied etc. There are of course way around this but for now you might want to just stop Visual Studio from trying to re-activate each feature on deploy by turning it off in the Visual Studio project properties under the SharePoint tab – select “no activation” as the active deployment configuration and Visual Studio won’t try and activate every feature each deploy.

Anyway, create a new site-collection from Central Administration again of type “Pet Store”. Assuming it all worked you should have a new site with the horrendous green-background again, but now with a nice button:

image

Pressing the button should execute our page-embedded code and tell us the time. Try it.

Setup Log-Shipping to Disaster Recovery Farm & Test

Now we know our solution works let’s now ship the content database to the DR farm and see what happens when we try and use the site without any solutions deployed there first. It won’t work of course so we’ll figure out why when it does die on opening the site.

The Website Died – File Not Found (HTTP 404)

Somewhat unsurprisingly loading the DR site didn’t work.

image

This is because the default.aspx doesn’t exist on the 2nd farm. That’s to say it doesn’t exist on the file-system because we’ve not changed the page on the site and therefore it’s not “ghosted” in the content database which we have copied over to the DR farm, and it’s not on the file-system because we haven’t installed the feature. So really, page isn’t found.

Ghosting Site Pages

This is a handy moment to demonstrate how page ghosting works because what we can do here on the original farm is make edits to the pages in the WSP solution so that the copies are stored in the content database. Why? Well because those pages will be replicated to our DR farm and then we’ll have one less dependency to setup in the DR farm.

On the 1st farm, edit default.aspx and customTemplate.master in SharePoint Designer. Make any change you want just as long as it saves – you’ll need to edit the files in “advanced mode” which is normally available via the right-click menu (although I’ve had some difficulty always seeing that option for reasons that weren’t clear).

image

This message is basically saying “I’m going to have to copy this file to the content-DB as the file contents is now different from the default file – continue?” – This is exactly what we want; this will ghost, or copy the page to the content database and therefore magically make its way to the DR farm next sync.

Now load the page.

The event handler ‘OnClick’ is not allowed on this page

You might notice, if you’ve been following the pet-shop guide in general, that after editing the page in SharePoint Designer, loading it will give you the above error message. This is because we’ve converted the page from a static file to a database-saved version that’s built & compiled slightly differently, even if to the user (and even developer) the file behaves the same. The file is ghosted and ghosted files are subject to a slightly different compiler and parser behaviour due to the different nature of where the files are coming from – file-system static files can only be put there by farm administrators whereas almost anyone “could” write an ASPX page and both effectively inject “user” code (not Microsoft code) into the w3wp process which has obvious security and reliability concerns. Because of these concerns DB-based pages have different security defaults which is why you might be seeing this error even though the page has barely changed.

No worries though – it’s just a case of configuring SharePoint to allow compilation of in-line code with this web.config modification - https://msdn.microsoft.com/en-us/library/bb862025(v=office.12).aspx

Note, there are known potential performance issues with allowing this and it’s not particularly a good practise in general as it allows full-trust code to run server-side (which is why it’s off by default) but for the purposes of this demo, to show which code makes it over log-shipping and which doesn’t, we’ll enable it. More info @ https://support.microsoft.com/kb/2659203

Add a page-parser directive to allow compilation at all levels (a genuinely bad idea normally)

<PageParserPath VirtualPath="/*" CompilationMode="Always" AllowServerSideScript="true" IncludeSubFolders="true"/>

Pages Ghosted to Content DB are Synced to DR Farm

So now we’ve ghosted the files we wanted, the site should load on the DR farm right? Wrong – you’ll see the same error about “OnClick” on the DR farm because we didn’t apply the same web.config change there too. I deliberately omitted this step above to demonstrate that accidently forgetting to apply web.config changes to both farms can kill the 2nd farm. This is a nice moment to mention that web.config differences really can undo all the hard work involved in setting up a DR farm.

The Disaster Recovery Website Lives!

Once we’ve sorted out the web.config of the DR farm too you’ll see the page load now, although in the usual read-only mode. That’s despite not having deployed the solution WSP yet too; because the master-page and default page were ghosted in the content-database and that change has made it to the DR farm, we don’t need the original file anymore to render the page. If either file is missing of course then the page won’t render but both were changed (therefore ghosted) and have been shipped over to the DR farm. You’ll note, the button even works.

image

However Not All is Working – Event Receivers Aren’t Firing

Now at this point you might notice that not everything really is working. Specifically the event-handler we programmed; given the fact that our DLL hasn’t been deployed to the secondary farm then there’s no way it could work. Let’s test just to be sure.

To test our lists event-handler we need read/write access to the database. That means breaking the log-shipping and enabling read/write access on the DR copy – let’s do that and reload. In short you need to find the source database on the primary SQL Serve & disable log-shipping. This will clean up the agent jobs and disable further updates. Now on the secondary, enable read/write mode (set single-user mode; restore with recovery; set read_write mode; set multi-user).

Now go-to a list in the site – ‘inventory’ for example and add a record. Notice how the new record isn’t changed to ‘PetStoreSalesReceiver Override’ as per our custom solution:

image

So that’s bad; we have a half-baked disaster-recovery farm where some code works and other code doesn’t, and it’s not obvious that anything’s wrong on the face of it. In reality of course, not having the code could mean the whole application fails to load as we’ll see in a minute. Notice by the way, the pink background is the change we made in SharePoint Designer so it’s obvious it’s been updated.

Add Realism: Add Solution Code to the Site Pages

In reality the above situation is fairly unrealistic as far as customisations go – it’s rare to have code in pages as it’s actually a pretty terrible practise for many reasons. I did it to highlight only how some code and pages may work even without having installed the dependant solutions on the secondary farm. Most SPDev projects I’ve seen at least have at least one or two ASCX resources which adds a layer of complexity somewhat so for all of the above reasons we should do the same.

So let’s go back to our 1st farm and add a custom-built user-control to our master-page then. We’re not trying for any code-awards here so we’ll add a user-control that tells what time the page was loaded. Add a user-control to the project called ‘PageLoadedTime’ and code it as follows:

image

Notice how Visual Studio adds the control to the “Control Templates” folder. This means the file will never make it into the content-database and will be referenced from the file-system of the SharePoint server instead.

Now add it to the default.aspx page so we now have the following snippets – header:

<%@ Register Src="~/_controltemplates/15/PetStoreSolution/PageLoadedTime.ascx"

TagPrefix="msdnDemo"

TagName="PageLoadedTime" %>

…and then this as the body:

<p>This text is coming from my own default.aspx</p>

<h1>Embedded Code test</h1>

<asp:Button ID="btnUpdateTime" runat="server" OnClick="btnUpdateTime_Click" Text="What time is it?" />

<br />

<asp:Label ID="lblTime" runat="server" Text="Good question..."></asp:Label>

<script runat="server">

protected void btnUpdateTime_Click(object sender, EventArgs e)

{

lblTime.Text = string.Format("It's {0}",DateTime.Now.ToLongTimeString());

}

</script>

<br />

<!-- User control -->

<msdnDemo:PageLoadedTime runat="server" id="PageLoadedTime" />

<!-- END: User control -->

So here we have a page that could be coming from a content-database referencing a control that could never come from a content-database. This is completely normal of course but could also cause fatal load errors if the target control isn’t found of course.

On the primary farm it works a treat:

image

Re-Sync the Content Databases (Still Without a Deployment)

So to go-back in time slightly, to where we were when log-shipping was enabled & the DR farm was in warm-standby/read-only mode we now want to try loading the default page on the DR farm that previously worked fine and see what happens.

The Website Died – File Not Found (Again)

No surprises, the disaster recovery site fell-over again giving us a slightly misleading error of file-not-found which is somewhat odd given the fact you’d see both the default.aspx page and the masterpage in the site if you looked. The problem isn’t with the pages though.

image

Looking into why tells of course - the page won’t load because ASP.Net cannot find all the page dependencies, in this case our recent user-control so the page-load dies.

Now Deploy the WSP to the Disaster Recovery Farm

To fix this situation let’s actually deploy the WSP to the DR farm. All it takes is to add the solution and then deploy it to the web-application in question.

image

So far so good – everything is working on the DR farm.

The Perils of Not Updating WSPs on Both Farms

In case it wasn’t obvious I’ll just touch on what could happen if for some reason WSPs and custom-solutions weren’t synced manually on-top of the content databases. It’s simple; if a page needs a resource or assembly that’s local on the file-system or global assembly cache then that page will die. Ghosted pages are synced via log-shipping; WSPs aren’t.

Also of course, as you’ve seen above it’s possible that the page will load but crucial event-receivers won’t fire because of a half-baked deployment. Depending on what that code does it could be critical too.

No Activating Features

Notice how we didn’t have to activate anything on the 2nd farm in order to get anything to work. Well in short, that’s because we activated them already on the 1st farm and just copied over the content-database with its already activated site-collection.

Notice too that the user-control doesn’t need activation to work of course; it’s installed on the file-system of the SharePoint servers rather than in the content-database. That means of course that just deploying the WSP is sufficient – the (content-database hosted) page references our user-control via the /_controlTemplates virtual path which is of course file-system hosted, so each machine hosting the IIS sites needs a copy.

Disaster Recovery Farm Solution Syncing - Failure Points

We’ve seen basically what happens when a SharePoint farm doesn’t have all its’ dependencies because of an incomplete sync of some kind. So in summary, to avoid unexpected failures our farms need identical:

  • Content databases. SQL Server log-shipping takes care of that and it’s probably worth monitoring for sync failures too, which is a bit too DBA heavy for this article now. Failures include:
    • Log-shipping unsuccessful for too long – old TRN files deleted thus breaking the log chain and making restoring the transaction-logs impossible without a new backup/restore.
    • Pages left in old state despite custom-solutions being updated – the mismatch in versions may cause a fatal load error.
  • Web.config changes. These can be kept in sync over the farm machines with the SPWebApplication.WebConfigModifications collection but this property is a farm-level property so will need to be manually checked every so often to make sure there’ll be no web.config errors like the OnClick one above. Failures include:
    • Web.config mismatch – dependent custom code needs something in the web.config file that isn’t there so crashes. Either application settings or invalid assembly references – one small mismatch can be the difference between the DR falling over completely or not.
  • Custom solutions. You can reference static controls/web-services/etc in content-db pages; if those two worlds don’t match up because the page got updates but the static control didn’t then that page won’t load. Failures include:
    • Same as a content-database sync failure – mismatch of static/db-based code could likely cause a fatal load failure.
    • Pages that aren’t ghosted because of a change won’t exist on the DR farm and will give a HTTP 404 when accessed.

Ultimately disaster recovery farms are not the most automatic things in the world and require a certain amount of careful management is they are to do their job correctly. All of this is necessary to ensure that when we need the disaster recovery farm to go live that it can take over within minutes or even seconds of a failure on the 1st farm and that instead of the user being presented with just a different error from a different farm that instead they barely realise anything’s changed.

Maintaining Disaster Recovery Farms – Tips

Easy: check regularly that you can failover without anyone noticing. There’s no reason why you couldn’t just hop-on to the DR farm even if the primary farm is operating just fine, just to double-check everything loads.

On-top of that, when running deployments to production always make sure the same scripts are run on the DR farm once the primary has been confirmed as OK.

Disaster recovery sites/farms are the ultimate tool in the SharePoint high-availability toolbox if done right. It offers the ultimate assurance you’ll be able to provide service as close to 100% of the time as humanely possible, which ultimately is our job to provide where possible.

Cheers,

// Sam Betts