Understanding RPO and RTO considerations of Azure Solutions

Traditional on prem Disaster Recovery (DR) discussions almost always include requirements around RPO and RTO.  I was asked recently if given more and more hybrid (onprem to cloud) and cloud to cloud DR deployments, if RPO and RTO are still relevant.

 

In short, RPO and RTO are still extremely relevant design considerations.

 

To make sure we are using these terms consistently:

  • RPO stands for recovery POINT objective, i.e., how much data is one potentially prepared and willing to lose, worse case
  • RTO stands for recovery TIME objective, i.e., if/when the ‘bad thing’ happens, how much time does it take to be back up and running again

 

Many clients first reaction is they want RTO and RPO of zero (i.e. NO data loss with no downtime).

 

While this is technically possible, RPOs of zero require synchronous replication.  Synchronous replication by design require multiple writes/updates/deletes in multiple locations before giving an ACK back to the application.  These additional transactions to multiple locations may introduce unacceptable performance, typically due to network distances and associated latency (think speed of light overhead).

 

More traditional IaaS Azure business continuance and disaster recovery solutions like Azure backup and Azure Site Recovery (ASR), as well as many of our Azure Marketplace partner protection solutions, are generally asynchronous by design and therefore provide RPOs > 0.

 

From a design perspective it is nearly impossible to guarantee specific RPOs and RTOs for these type of solutions because many variables are outside of your control, HOWEVER, here are some general guidelines…

 

RPO of backup solutions are most dependent on the backup policies.  For example, if someone setups up a daily backup policy, then the RPO is closer to a day.

 

RPO of replication solutions are often most dependent on the distance separating the two sites.  For example, when someone configures ASR to replicate across two regions, then the RPO is more likely to be in the ~seconds to many seconds range.

 

When designing for RTO it is important to understand the variables that are not always in your control.  For example, if someone initiates a restore, the time it takes to be back up and running is dependent on variables like the size of the restore, available network bandwidth, speed of the disk drives/VMs, etc.

 

In a more traditional DR failover scenario whether onprem to cloud or cloud to cloud, it is common to use a service like Azure Site Recovery.  Since the data has already been replicated, the RTO in this case has many dependencies including how long it takes to provision the DR infrastructure on the ‘other side’, speed of the disk drives/VMs, time to run the recovery plan, time to propagate the appropriate DNS changes to point to the ‘other’ side, etc.  Generally in the ~minutes to many minutes range.

 

In summary, it is difficult to guarantee RPO/RTO targets as there are many dependencies not necessarily in your control but it is still critically important to understand your RPO and RTO targets from a requirements gathering perspective.  Knowing if your requirements are truly RPO and/or RTO of zero, a minute or two, a few hours, daily, etc, can help you design the most appropriate Azure based solution.