Recovery Scenarios for E2K7…..I

This is the first of a few blogs about designing for availability and resilience…  WHAT FAILURES MIGHT OCCUR AND HOW DO WE CHOOSE THE RIGHT DESIGN TO PROTECT US?

In the very early stages of a messaging design, and in particular at the point at which discussions surface concerning availability and resilience, it is often very useful to understand the type of issues that support teams are likely to face and how your proposed design stacks up.

EXAMPLE DESIGN

So first I need an example design. For the purposes of this blog I am using a pretty standard Exchange 2007 design based on CCR\SCR across 2 data centres.  The design is best described on Technet here; ‘Site Resilience Configurations’. (See section ‘Production (Non-Dedicated) with One Active Directory Site’ – “This solution deploys redundant servers in a single Active Directory site that spans both datacenters.”)

Site_Resilience_Technet

I’m also using DPM for VSS based backups to disk, with long term backups to tape media, and there is a requirement to journal all messages to satisfy compliance regulations.

WHAT MIGHT GO WRONG?

The scenarios I’m going to base this on are as follows:

Data Centre Failure: The loss of an entire data centre
Server Hardware Failure: Component failure e.g. motherboard
Storage Failure: Access to all or a part of a volume\LUN – not including single disk failure
Mailbox Database Corruption (Physical): Most likely as a result of hardware failure
Mailbox Database Corruption (Logical): Data corruption may be as a result of faulting application or virus
Mailbox Deletion within Deleted Mailbox Retention period (<30 days): A result of an administrative or procedural error
Mailbox Deletion beyond Deleted Mailbox Retention period (>30 days): A result of an administrative or procedural error or returning employee
Email or Item Deletion (<14 days): User mistakenly deleted an item –administrator intervention required only if item hard deleted
Email or Item Deletion (>14 days): User mistakenly deleted an item –administrator intervention required
Identify if and when a particular email was sent\received (<30 days): Only message route required
Identify if and when a particular email was sent\received (>30 days): Only message route required
Identify if and when a particular email was sent\received (<14 days): Entire message required
Identify if and when a particular email was sent\received (>14 days): Entire message required

HOW DOES MY PROPOSED DESIGN PROTECT ME?

The following table takes the above scenarios and determines where the protection against the occurrence of each particular scenario is in your design.  This first pass should help us to understand what might fail, what protection the design provides, the likelihood of the scenario occurring and the impact of that event.

Scenario Mitigation Impact (Worst case) Estimated Recovery Time Likelihood
Data Centre Failure SCR ( & redirection of network traffic\email) Temporary loss of service to all users during presentation of SCR targets Minimal data loss <2 hours* Very low
Server Hardware Failure CCR Temporary loss of service to all users on a single Exchange Server during cluster failover <15 minutes Moderate
Storage Failure CCR (single disk failure mitigated by RAID) Temporary loss of service to all users on a single Exchange Server during cluster failover <15 minutes Moderate
Mailbox Database Corruption (Physical) CCR** Temporary loss of service to all users on a single Exchange Server <15 minutes Low
Mailbox Database Corruption (Logical) DPM restore from disk Temporary loss of service to all users on a single mailbox database <2 hours Low
Mailbox Deletion within Deleted Mailbox Retention Period (<30 days) Deleted Mailbox Retention*** Temporary loss of service to a single user\temporary loss of all data <15 minutes High
Mailbox Deletion beyond Deleted Mailbox Retention period (>30 days) DPM restore of database from tape n\a <8 hours High
Email or Item Deletion (<14 days) Deleted Item Retention**** Loss of single\multiple items for a single user <15 minutes High
Email or Item Deletion (>14 days) DPM restore of database from tape n\a <8 hours Moderate
Identify if and when a particular email was sent\received (<30 days) Message Tracking***** n\a <15 minutes Low-Moderate
Identify if and when a particular email was sent\received (>30 days) DPM restore of single\multiple databases from tape n\a <2 days Low-Moderate
Identify if and when a particular email was sent\received (<14 days) Message Journaling****** n\a <1 hour Low-Moderate

Identify if and when a particular email was sent\received (>14 days)

Message Journaling n\a <1 hour Low-Moderate

* Whilst it is estimated that invoking the SCR target might take place in less than 2 hours, the loss of an entire data centre might mean that the complete service (including the redirection of Outlook clients, the internet connection, and the recovery of all ancillary services, such as an archive solution; may mean that resumption of service takes more than 2 hours.
** The alternative to failing over the entire server to the CCR replica is to restore a single database from disk using DPM. This increases the impact for the users will mailboxes on the affected database but provides no loss of service to users on the rest of the server.
*** The default Deleted Mailbox Retention period is 30 days which is configurable.
**** The default Deleted Item Retention period is 14 days which is configurable.
***** Message Tracking Logs are by default kept for 30 days. This is a configurable setting.
****** Currently it is assumed that all email is journaled and archived and retained for a period according to compliance requirements.

So to use an example from the table above.  If an administrator was asked by customer to identify an email that was sent or received over 30 days ago (not actually provide the message itself but identify when it was sent and received) then they would have to identify the databases where the sender and recipient mailboxes were located at the time of the message delivery, restore them and try to find that message.  A long and laborious task which might take up to 2 days.  In my example I have assumed that the likelihood of this occurring is low-moderate. This exercise should highlight the areas where your proposed design doesn’t provide the protection that your specific company requires of it.

The next blog in this series is called ‘Recovery Scenarios for E2K7…..II’ and looks at each component of the design to determine which of them brings the most value at the smallest cost so that we can make a more informed decision as to which to choose to deploy…