Azure IaaS VMs and the Three (now Four!) R's

 updated 27 May 2016

When you have a problematic IaaS VM (won't start, won't stop, can't RDP even though it worked just yesterday....) and you've exhausted your usual tricks, turn to the Three Four R's (the "did you reboot?" of Azure).

  1. Restart - Most users will think of this, just be sure you restart (and I mean a stop and start, not the restart button) from Azure (portal or PS) and not from the VM.  The Azure restart will give the fabric a chance to look for issues and self-heal.  Note: If you have a classic VMs (old portal) this also applies to your cloud service.  I've seen VMs acting up due to issues with their cloud service.  You can try restarting the cloud service....but keep in mind all hosted VMs by the cloud service will be restarted as well!
  2. Resize - Resizing (esp. if you increase the VM size to the largest possible), it will recreate certain elements of the VM <-> Fabric connection and could even move you to a new cluster node on the backend.   Resize, test to confirm it's working, and then size back to your original size.  You will get charged at the higher VM size rate, but if it's only for 15min that's a minimal cost.  *Note: if you're in ARM (VMs deployed via the new portal, not a classic VM) you can directly redeploy to a new cluster node using Azure powershell: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-redeploy-to-new-node/ (step 4 below)
  3. Recreate - This one sounds scary, but if you make a note of your current configuration FIRST the recreation is quick (generally <20min) and relatively painless.  If you need to move your VM to a new cloud service, VNet, etc. or are having those "it's just acting up" issues this is a good troubleshooting step to try out.   You basically are removing the Azure components and then recreating the Azure components (modifying if needed) -- all while leaving your disks untouched.  See https://www.petri.com/recreate-virtual-machine-in-microsoft-azure  for step-by-step instructions.  *Note: The link is specific to ASM (classic portal) but the premise works for both classic and ARM VMs.
  4. Redeploy - Available in the new portal only for ARM VMs (not classic).  Effectively the same as a resize, only guaranteed that you change nodes.  https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-redeploy-to-new-node/

If these all fail, and you already confirmed there are no outages that could impact you (https://azure.microsoft.com/en-us/status/), it may be time to engage Microsoft.

See also https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-allocation-failure/

Thanks to my coworkers Kyle and Madan for their peer review!