Continuous Replication, Site Resilience & decision making processes...

The ability to continue to provide a full service to your user community in the unlikely event of the loss of a datacentre is an increasingly common requirement. The use of Continuous Replication (CCR and SCR) with Exchange 2007 Service Pack 1 is an obvious choice in providing data availability and site resilience. One advantage to this approach is that the reliance on expensive storage replication solutions is eliminated. In addition, a disaster recovery scenario is managed from within one team rather than several teams. In most cases the messaging team can manage the restoration of service without the intervention from the storage team, or from a remote 3rd party hardware vendor for example. The use of Exchange Server data replication as opposed to storage replication solutions also gives us more options to use Powershell scripts to assist administrators in simplifying and controlling service and data recovery and application-based replication is generally superior in assessing the health of the data being replicated than hardware or storage-based replication.

An example Exchange 2007 design using continuous replication is as follows:

E2K7 Architecture Example

With any design it is important to understand the processes and decision making that might be involved when certain scenarios present themselves. If we are designing for high availability administrators need to understand what decisions might need to be made and the processes that would be required should a particular set of circumstances occur. For example, what should the recovery strategy be in the event of the loss of a single mailbox database? Should the Exchange cluster group be moved to the passive node at this stage? If so this would mean the temporary loss of service to all users on this server for the sake of those on one mailbox store.

The following flowcharts show the likely processes and decision making flow that might be involved in certain disaster recovery situations based on the above Exchange 2007 design.

Total Site Failure - Likely steps & decision making process in recovering from total physical site failure

E2K7 Flowcharts - Site Failure

Single Server Failure - Likely steps & decision making process in recovering from single server failure

E2K7 Flowcharts - Server Failure

Single Database Failure - Likely steps & decision making process in recovering from single active database failure

E2K7 Flowcharts - Database Failure

These decision matrices do not provide the definitive answer and there are often numerous possible recovery paths in any given Disaster Recovery scenario. However they do highlight the decisions that are likely to be made and the importance of understanding what the processes an administrator might have to follow to recover service and data to their user community.