High(er) Availability Solution for Single-Instance Roles

Some I.T. system architectures include single-instance roles (a single VM in the role), whether because the software running in that tier won't support multiple instances or it's too expensive or some other reason. VMs sometimes go down in the public cloud but these single-instance roles really should be resilient to failures if not actually highly available. This blog sets out one way to accomplish this with Microsoft Azure platform services. Please note this is not a "high availability" solution because the downtime (Recovery Time Objective or RTO) of the role is measured in minutes, not seconds. So I call it "higher availability" since it's fully automated.

SingleInstanceRole

As depicted, an Azure AppInsights Availability Web Test is configured to ping a URL on the single-instance role. This ping can be configured to originate from multiple regions around the world and the results aggregated in various ways to yield success or failure. See this page for more details. Upon failure, the web test will alert. This page will give more details on setting up alerts. One way to automate response to an alert is a webhook. Learn more here. In my case, the webhook is triggering an Azure Automation Runbook. This runbook is a PowerShell script that cleans up my old failed VM and replaces it with a new one.

I hope I've made it look simple. It really is simple at this level, but there are some underlying details that aren't so simple. Such as - how do you know the parameters to use in setting up the new VM? Also, there are some interesting details in understanding the behavior of the way an Availability Web Test alerts. And how are actions against your Azure subscription authenticated? Where does this Azure Automation Runbook come from? These are the things that I'll explain in more detail.

The architecture diagram shows a system in place as opposed to showing the way it came to be. The system is created with an ARM template in Microsoft Azure. The ARM template is created with a set of parameters so you can vary things like storage accounts, virtual networks, IP address names, and so on. It's these parameters that you need to remember for later in case you have to replace the VM in a single-instance role. Azure Automation has a facility for remembering them: Azure Automation Variable Assets. These variables will receive their values through an additional resource in the originating ARM template, which I will illustrate. The ARM template that creates the replacement VM could be a simple copy of an example template, but more likely you'll have to tailor one to fit. Your originating template will create lots of things but your tailored one will just create the VM.

From start to finish the process looks like this at a high level:

  • Create resource group using the Azure Portal
  • Create Azure Automation account in that resource group using the Azure Portal
  • Use PowerShell/Azure CLI/Azure Portal to execute your originating ARM template
  • Create Azure AppInsights component, Availability Web Test, configure web hook

There's a very good reason to create the Azure Automation account in the Azure Portal: it does a lot of other necessary work for you in the process. This page takes you through it. The page also explains about how your runbooks will authenticate using an Azure Run As account. We're not interested in the comments regarding the Classic Run As account. A few of the Integration Module Assets need to be updated before your runbook will run correctly. They must be done in the correct order because of dependencies. Here's the list: (the first two must be done in that order, after that they can all be done simultaneously)

  • AzureRm.Profile
  • Azure.Storage
  • AzureRm.Compute
  • AzureRm.Storage
  • AzureRm.Resources

The originating ARM template will need to be augmented to add a few Azure Automation Assets, specifically the runbook and variables. The runbook script, by current standards, should be placed into a "scripts" folder off of the main template directory. Ultimately, it needs to be in a reachable web location for the ARM engine to be able to access it. Here's some sample template script that will install the runbook and variables:

 

"resources": [
    {
        "name": "[parameters('automationAccountName')]",
        "type": "Microsoft.Automation/automationAccounts",
        "apiVersion": "[variables('automationAPIVersion')]",
        "location": "[parameters('automationAccountLocation')]",
        "properties": {
            "sku": {
                "name": "[parameters('automationAccountSku')]"
            }
        },
        "resources": [
            {
                "name": "runbook_deployment",
                "type": "runbooks",
                "apiVersion": "[variables('automationAPIVersion')]",
                "location": "[parameters('automationAccountLocation')]",
                "dependsOn": [
                    "[concat('Microsoft.Automation/automationAccounts/', parameters('automationAccountName'))]"
                ],
                "properties": {
                    "logVerbose": false,
                    "logProcess": true,
                    "description": "Triggered by availability web test, removes failed VM, replaces it.",
                    "runbookType": "PowerShell",
                    "publishContentLink": {
                        "uri": "[concat(parameters('templateScriptsUrl'), parameters('automationRunbookName'))]"
                    }
                }
            },
            {
                "name": "recovery_resourceGroup",
                "type": "variables",
                "apiVersion": "[variables('automationAPIVersion')]",
                "location": "[parameters('automationAccountLocation')]",
                "dependsOn": [
                    "[concat('Microsoft.Automation/automationAccounts/', parameters('automationAccountName'))]"
                ],
                "tags": {},
                "properties": {
                    "value": "[concat('\"',resourceGroup().name,'\"')]"
                }
            },
            {
                "name": "recovery_templateUri",
                "type": "variables",
                "apiVersion": "[variables('automationAPIVersion')]",
                "location": "[parameters('automationAccountLocation')]",
                "dependsOn": [
                    "[concat('Microsoft.Automation/automationAccounts/', parameters('automationAccountName'))]"
                ],
                "tags": {},
                "properties": {
                    "value": "[concat('\"',parameters('TemplateUrl'),'\"')]"
                }
            },
            {
                "name": "recovery_osSettings",
                "type": "variables",
                "apiVersion": "[variables('automationAPIVersion')]",
                "location": "[parameters('automationAccountLocation')]",
                "dependsOn": [
                    "[concat('Microsoft.Automation/automationAccounts/', parameters('automationAccountName'))]"
                ],
                "tags": {},
                "properties": {
                    "value": "[concat('\"',replace(string(parameters('osSettings')),'\"', '\\\"'),'\"')]"
                }
            },
            {
                "name": "recovery_alertTime",
                "type": "variables",
                "apiVersion": "[variables('automationAPIVersion')]",
                "location": "[parameters('automationAccountLocation')]",
                "dependsOn": [
                    "[concat('Microsoft.Automation/automationAccounts/', parameters('automationAccountName'))]"
                ],
                "tags": {},
                "properties": {
                    "value": "[concat('\"','8/19/2016 5:37:33 PM','\"')]"
                }
            }
        ]
    }
]

Particularly of note is the escaping of the variable values. In the example above, osSettings looks like this:

 

"osSettings": {
  "imageReference": {
    "publisher": "me",
    "offer": "my-enterprise-base-image",
    "sku": "my-on-ubuntu-14-04-lts",
    "version": "1.0.9"
  },
  "scripts": [
    "[concat(variables('templateScriptsUrl'), 'vm-disk-utils.sh')]",
    "[concat(variables('templateScriptsUrl'), 'setup.sh')]"
  ]
}

The value must be escaped as depicted above (in the ARM snippet) so that it arrives correctly into the Azure Automation Variable Asset. Once retrieved into the runbook, it must be further processed to make it suitable for passage along to the ARM template that re-creates the VM. These details are in this blog.

Once the template is complete and the runbook is installed, we need to create a webhook for it. This is done in the Azure Portal according to these instructions. Save this for the next step.

The last step in the process is to create the Azure AppInsights objects and configure them. Once this is done, your VM will be monitored by the web test and protected by your runbook. In the Azure Portal, create an Azure AppInsights component if you don't already have one. It can go into the same resource group as your system or a different one. AppInsights isn't in all regions so pick one near to your system deployment. I selected "Other" for Application Type for just monitoring my VM endpoint, but that's not important.

monitorMyVM

Then in Settings / Investigate / Availability, click "+ Add Web Test" and configure.

testMyVMEndpoint

Finally, paste your webhook address into the "Alert" blade.

myWebhook

The only thing left is the runbook itself. Its job is to replace the problem VM. In my case, I delete the failed VM and then build a new one based on the same or very similar ARM template using the same variable values as the original. Here's the script:

 

"VM repair job initializing..."

# get the time of the alert being processed (or the time of the last one that was processed)
$alertTime = Get-AutomationVariable -Name 'recovery_alertTime'
$now = Get-Date

# if alertTime being processed is between now and 20 minutes ago, exit
if ($now.AddMinutes(-20) -lt $alertTime) {
    "Redundant alert ignored..."
    exit
}

# still here? new alert, record the time
Set-AutomationVariable -Name 'recovery_alertTime' -Value $now.ToString()

"VM repair job proceeding..."

$connectionName = "AzureRunAsConnection"
try
{
    # Get the connection "AzureRunAsConnection "
    $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName         

    "Logging in to Azure..."
    Add-AzureRmAccount `
        -ServicePrincipal `
        -TenantId $servicePrincipalConnection.TenantId `
        -ApplicationId $servicePrincipalConnection.ApplicationId `
        -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint 
}
catch {
    if (!$servicePrincipalConnection)
    {
        $ErrorMessage = "Connection $connectionName not found."
        throw $ErrorMessage
    } else{
        Write-Error -Message $_.Exception
        throw $_.Exception
    }
}

$rgName = Get-AutomationVariable -Name 'recovery_resourceGroup'
$templateUri = Get-AutomationVariable -Name 'recovery_templateUri'

# delete the failed vm  
Remove-AzureRmVM -ResourceGroupName $rgName -Name my-vm -Force

# remove his vhds
$accts = Get-AzureRmStorageAccount -ResourceGroupName $rgName
$acct0 = $accts[0].StorageAccountName

$acctKeys = Get-AzureRmStorageAccountKey -ResourceGroupName $rgName -Name $acct0
$acctKey = $acctKeys[0].Value

$ctx = New-AzureStorageContext -StorageAccountName $acct0 -StorageAccountKey $acctKey

Remove-AzureStorageBlob -Context $ctx -Container vhds -Blob 'my-vm-osdisk.vhd'
Remove-AzureStorageBlob -Context $ctx -Container vhds -Blob 'my-vm-datadisk1.vhd'
Remove-AzureStorageBlob -Context $ctx -Container vhds -Blob 'my-vm-datadisk2.vhd'

# get the ARM template settings from variables established by initial provisioning template
$osSettingsString = Get-AutomationVariable -Name 'recovery_osSettings'

# osSettings
$osSettingsObject = ConvertFrom-Json -InputObject $osSettingsString
$imageReference = @{ `
    'publisher' = $osSettingsObject.imageReference.publisher; `
    'offer' = $osSettingsObject.imageReference.offer; `
    'sku' = $osSettingsObject.imageReference.sku; `
    'version' = $osSettingsObject.imageReference.version `
}
$scripts = @($osSettingsObject.scripts[0], $osSettingsObject.scripts[1])    
$osSettings = @{"imageReference"=$imageReference; "scripts"=$scripts}

# the rest
$parameters = @{}
$parameters.Add("osSettings", $osSettings)

try {
    # run the template that adds a new cluster master
    New-AzureRmResourceGroupDeployment `
        -Mode Incremental `
        -Name myTestingDeployment `
        -ResourceGroupName $rgName `
        -TemplateUri $templateUri `
        -TemplateParameterObject $parameters
}
catch {
    Write-Error -Message $_.Exception
    throw $_.Exception
}

Each web test failure event actually causes 2 alerts to be sent out from Azure AppInsights. The first one is the "failure" alert and the second is the "recovery" alert. The 'recovery_alertTime' variable is a special one designed to prevent the "recovery" alert from triggering the runbook a second time. For brevity, I didn't include all of the variables that you would normally see, such as admin username and password, storage settings and so on. You could elect to use the same VHDs over again rather than deleting them. It all depends on what you think the source of your failures will be.