Site resilience with both DPM secondary's AND SCR targets...?

Article
04/16/2008

Let's say I have two data centres and a requirement that in the event of the loss of the primary data centre I can restore the messaging service with minimal downtime and minimal data loss. The decision is made to use SCR. I have also however deployed DPM for short term protection of the CCR replica databases to disk in the primary data centre. Given that the loss of the primary data centre could be the result of something like a flood or hurricane or bomb and therefore pretty catastrophic there is a risk, however minor, that upon activating my numerous SCR targets that one or more does not mount successfully. At this point I'm in trouble because my backup is in the primary data centre which isn't available... Again this assumes that I have no additional tape based offsite backups; something which many companies are trying to reduce their reliance upon.

So as I see it there are a couple of choices to consider for protecting ourselves against this risk which are:

Protect the Primary DPM Servers in data centre 1 with Secondary DPM Servers deployed in the second data centre.
Backup the SCR targets.

So what are the pro's and con's of these options? The main positive for using DPM secondaries is that you have a fully supported and documented approach to protecting your data and also a managed replica of all of the data held on the primary DPM servers in the second data centre. You have satisfied your main requirement. Ok so the data in the second site might be up to 6 hours out of date but you haven't lost everything and to mitigate against both a data centre failure and a SCR failure being up to 6 hours behind is probably acceptable for the majority. Another advantage is that you have some built-in resilience to the failure of a single DPM server because the secondary DPM server can take over from the primary temporarily.

One main disadvantage is the fact that in effect you are replicating all of your mailbox data twice; through both SCR and DPM replication. In large scale deployments this could be a significant factor in your decision making. The other bit that needs some consideration is what happens in the event of site failure. The secondary DPM servers hold all of the data from the primary DPM servers meaning I have data that can be restored to recovery servers but no spare capacity to begin protecting the newly activated SCR targets. (..this definitely needs testing. Can the DPM servers in the secondary data centre continue to protect the SCR targets and take changes only? i.e. can the data held on DPM be 'synchronised' with the new primary databases. ) It is not currently documented that this is possible and perhaps more critically 'supported' (if it is I can't find it!) .

So to reduce replication traffic the other obvious solution is to backup the SCR targets. This would not necessarily be to replace protection of the CCR replica because that provides us with mitigation against different things such as logical database corruption but it could be used to replace the replication traffic associated with the primary to secondary DPM approach.

So let's start by saying that it is not currently supported to protect the SCR targets via an 'online' VSS snap as it is against a CCR replica for example. Therefore the mechanism would be to suspend transaction log replication to the SCR target, take an 'offline' backup using VSS and DPM, checksum the backup to verify it's integrity, and then resume replication once the backup job was complete. There is an article 'How to Verify a Standby Continuous Replication Copy' which walks you through the steps to use the command-line version of the VSS tool (VSSAdmin.exe) and 'Eseutil.exe /k' to perform a physical consistency check against a shadow copy of the the SCR targets. This difference being that we would be taking a standard file synchronisation type backup using DPM to be retained for however long we set our retention periods to, rather than simply taking a temporary shadow copy for the purposes of a consistency check.

On the face of it the second option does appear to be the most appealing. However there are a number of major considerations that I would want to cover off first. The first is Microsoft's support stance. I would want to clarify that first. The second is the process by which we restore a flat file copy of an SCR target in the event of site failure and a subsequent SCR failure. I would want to have a bullet proof process, tested regularly because the steps for restoring an offline copy of the last copy of a database and corresponding transaction logs (offline logs plus more current logs possibly) are not as straight forward as is often thought and have caused numerous issues in the past for support staff. Nonetheless given some more testing and clarification of these issues I believe it is definitely an option worth pursuing...

With any luck I'll get some time to get some of this in a lab to test...

Site resilience with both DPM secondary's AND SCR targets...?

Additional resources