GRS/RA-GRS in Azure and how to plan for DR

I was having a discussion yesterday and realized I had a gap in my knowledge on how to recover from a disaster using storage. I did some research, and I want to share this with everyone.  Plus it has been ages since I've shared anything new here.

This is the best source I’ve seen on the details of our Storage story for DR and from there, there is this link that goes into even more detail.

The article doesn’t explain as much as I would like, but the details come through in the comments of the first and the detail of the second links.

Comment from first:

It is important to note that GRS does not fail over to the secondary location unless there is TOTAL data center outage. If a few clusters are down and you are affected you may find yourself dead in the water without a paddle until those clusters are fixed. We just found ourselves in this situation and don't believe it is clearly stated as it should be. We basically spent the $1000 professional direct support only to be told that fail over was not an option due to only a couple of clusters ( including ours) being affected. If you want true Geo Redundancy in the event of a cluster issue you have to implement it yourself.

So based on the above, the customer does not have the ability to make this switch today, Microsoft makes the determination of making the switch. This is discussed in the second article:

Recovery Time Objective (RTO) : The other metric to know about is RTO. This is a measure of how long it takes us to do the failover, and get the storage account back online if we had to do a failover. The time to do the failover includes the following:

  • The time it takes us to investigate and determine whether we can recover the data at the primary location or if we should do the failover
  • Failover the account by changing the DNS entries

We take the responsibility of preserving your data very seriously, so if there is any chance of recovering the data, we will hold off on doing the failover and focus on recovering the data in the primary location. In the future, we plan to provide an API to allow customers to trigger a failover at an account level, which would then allow customers to control the RTO themselves, but this is not available yet.

You can have more control of this if you use RA-GRS.

Comment from the first article:

The VHD is replicated to the secondary site. You can not use the VHD to create a VM, because the secondary is read-only. You would have to copy it to a storage account in a primary data center to create a VM from it. Of course, as discussed in your previous question, if something catastrophic happened and the secondary became the primary, you could create a VM from the VHD file. This redundancy feature is about storage, and only storage, not about recreating your VMs and everything that uses something in storage. You might want to check out Azure Backup for something like that.

So how do I move my VM to a different data center and start to use my replicated storage myself in the case of an outage?

  1. Use RA-GRS
  2. Find the IP location of your <storage account>-secondary endpoint
  3. Create a new storage account on that same stamp/IP if possible or a different stamp in the same data center the -secondary is located in
  4. Use the secondary endpoint on the RA-GRS to copy the blob to the new storage account created above making it a writable copy
  5. Deploy your VM to that new data center (preferably through script) and bring it back up using the new storage.

Note that there are still a few other things to consider.

  1. If you can’t get on the same stamp the RA-GRS uses, then copies between stamps in the same data center take time. See this article - there is a chart there that detail the time to copy between stamps and data centers.
  2. Recovery Point Objective is 15 minutes for RA-GRS, but there is no SLA on Geo-replication, so there may be some data loss since Geo-Replication is asynchronous. From another blog post:
    1. Recover Point Objective (RPO) : In GRS and RA-GRS the storage service asynchronously geo-replicates the data from the primary to the secondary location. If there was a major regional disaster and a failover had to be performed, then recent delta changes that had not been geo-replicated could be lost. The number of minutes of potential data lost is referred to as RPO (i.e., the point in time to which data can be recovered to). We typically have a RPO less than 15 minutes, though there is currently no SLA on how long geo-replication takes.

Most of this information comes from BLOGs and the actual functionality is a moving target, so keeping up to date is an issue.  Please comment as things change.