Recover Azure VM by attaching OS disk to another Azure VM


[Update:1/28/2015] These steps apply only to classic VMs. If you need assistance with Resource Manager VMs, please contact Microsoft Support.

If you are unable to connect to an Azure VM with RDP or SSH even after restarting and resizing the VM, you can use the following steps to make it accessible again.

Manual Recovery Steps

Be aware of the following when using these steps to recover your VM:

  • Use the classic portal to perform the recovery steps (https://manage.windowsazure.com/). Recovery steps using the new portal will be provided at a later date.
     
  • It is recommended that you first backup the problem VM’s OS disk and data disk as a precautionary step. You can use one of the available Azure Storage Explorer tools such as Microsoft Azure Storage Explorer, the AzCopy tool, or Azure PowerShell’s Start-AzureStorageBlobCopy cmdlet to create backups of the VHD files.
      
  • The D: drive, by default, is the temporary storage drive and is reset when an Azure VM is resized, put in Stopped Deallocated (via portal Shutdown), or recreated from the same disk as in the steps below. See also About Virtual Machine Disks in Azure.
     
  • The external IP (Virtual IP or VIP) will change if you recreate the VM without keeping another VM running in the same cloud service. To prevent this, keep another VM running in the cloud service while performing the steps.
     
  • The internal IP address will change when recreating the VM if it is not in a virtual network. And even for a VM in a virtual network, the internal IP address may change after recreating the VM if the IP address it previously had, has been taken by another VM.
     
  • Write down or copy and paste the disk names in the Disk column at the bottom of the Dashboard for the VM before recreating it as in the steps below. You will need to remember which disks belong to that VM so you recreate the VM with the same disks.


     

You can troubleshoot the VM by attaching the OS disk as a data disk to another Azure VM using the steps below.

  1. Create a new VM in the same cloud service as the problem VM, using a gallery image of the same OS version as the problem VM. You will use this VM temporarily for troubleshooting.

    For example if the problem VM is Windows Server 2012 R2, create the troubleshooting VM from the Windows Server 2012 R2 gallery image. Similarly, if the problem VM is Ubuntu 14.04 LTS, create a troubleshooting VM from the 14.04 LTS gallery image.

    To verify the cloud service of the problem VM, select Virtual Machines in the management portal, select the problem VM, select Dashboard, then under Quick Glance on the right, the first part of the DNS Name is the name of the cloud service.

    For example, in the screenshot below, DNS Name is clnov8ws12r2a.cloudapp.net, so the cloud service name is clnov8ws12r2a.


     

  2. After creating a new VM in the same cloud service as the problem VM, select Virtual Machines on the left, click the problem VM on the right, then click Dashboard.


       

  3. Make note of the OS disk name in the Disks section at the bottom of the dashboard, since you will be using it later to recreate the VM. The disk name is under the Disk column on the far left. 
     

     
  4. Click Delete at the bottom right of the page, then click Keep the attached disks. This is necessary so the OS disk is not in use and can be attached to another VM.

    Click Yes on the prompt asking if you want to continue, which explains that The attached disks and their VHD files won't be deleted from your storage account.

  5. Click Virtual Machines on the left, and click the Disks tab at the top right. 
     
    Find the disk name from Step 2, and wait for the Attached To column to be blank. This can take up to 5 minutes after deleting the VM, though usually it will be much faster. 


     

  6. Click Virtual Machines on the left, and select the troubleshooting VM that you will use to attach the OS disk of the problem VM. Select Dashboard, then select Attach and then Attach Disk at the bottom of the dashboard. 
     

     
  7. In the Attach a disk to the virtual machine dialog, select Available Disks and choose the disk from the problem VM (you made note of the disk name in Step 2). Leave the Host Cache Preference on the default setting of None, and click OK
     

     
    If you do not see the disk here, either this troubleshooting VM is in a different location than the problem VM (i.e. in West US and the problem VM is in East US), or the disk has not yet been freed up for reuse and the Attached To column still shows the problem VM name instead of being blank.
     
  8. When the disk is attached to the second VM you will see a message in the portal Successfully attached disk <disk name> to virtual machine <name of troubleshooting VM>
      
  9. Click Connect to make an RDP connection to the troubleshooting VM. Or if the troubleshooting VM is Linux, create an SSH connection to it. 
      
  10. For Linux VMs, skip to Step 22, for Windows, in the troubleshooting VM, go to Start, Search, type diskmgmt.msc <enter> to open the Disk Management tool. 
     
  11. If the disk you just added shows up as Offline, right-click it and select Online. Most Azure VMs will be configured to automatically online new disks so this may not be necessary.


     

  12. After making sure the disk is Online, verify that each volume on the disk has a drive letter assigned. The specific drive letters assigned is not important.

    If any of the volumes do not have a drive letter assignment, right-click the volume and select Change Drive Letter and Paths, then Add. Select Assign the following drive letter, let it choose the next available drive letter, then click OK. Again, the actual drive letters used doesn't matter.


     

  13. Open an elevated CMD prompt and run Chkdsk on each partition on the drive in order to resolve possible file system consistency issues.

    For example if the drive from the problem VM has two partitions that are assigned letters E: and F:, you would run the following Chkdsk commands:

    chkdsk E: /F
    chkdsk F: /F

     

  14. Find the boot partition by running dir /a <driveletter>:\boot\bcd for each partition on the disk. 

    If you are unable to find the BCD store that way, instead run dir /a <driveletter>:\efi\microsoft\boot\bcd to find it.

    For example, the screenshot in Step 12 shows the disk has partitions E: and F:, in which case you would run the commands below, and the output shows that the F: drive is the boot partition.

    The actual drive letters may be different. If there is only one partition on the disk, that is partition is both the boot and OS partition.

    C:\>dir /a e:\boot\bcd
    The system cannot find the file specified.

    C:\>dir /a f:\boot\bcd

    Volume in drive F has no label.
    Volume Serial Number is 00F9-C289

    Directory of f:\boot

    02/22/2016  01:07 AM            28,672 BCD
                  1 File(s)         28,672 bytes
                  0 Dir(s)      75,763,712 bytes free
     

  15. Find the OS partition by running dir /a <drive letter>:\windows\system32\winload.exe for each partition on the disk.

    For example, the screenshot in Step 3 shows the disk has partitions E: and F:, in which case you would run the commands below, and the output shows that the F: drive is the OS partition.

    The actual drive letters may be different. If there is only one partition on the disk, that is partition is both the boot and OS partition.

    C:\>dir /a e:\windows\system32\winload.exe
    Volume in drive E has no label.
    Volume Serial Number is CC42-E527
    Directory of e:\windows\system32

    11/22/2015  06:59 AM         1,519,592 winload.exe
                   1 File(s)      1,519,592 bytes
                   0 Dir(s)  125,804,810,240 bytes free

    C:\>dir /a f:\windows\system32\winload.exe
    The system cannot find the path specified.
     

  16. Run the following command to set the default BCD entry:

    for /f "tokens=1,2" %i in ('bcdedit /store f:\boot\bcd /enum bootmgr /v') do if /I "%i" == "displayorder" (bcdedit /store f:\boot\bcd /default %j)

    You should see the following as part of the output (the GUID will be different) -

    if /I "displayorder" == "displayorder" (bcdedit /store f:\boot\bcd /default {8987d655-64f9-4ca1-bf51-e70f430dccd3})
    The operation completed successfully.
     

  17. Now when you run bcdedit /store f:\boot\bcd /enum bootmgr you should see default and displayorder are both set to {default}.

    C:\>bcdedit /store F:\boot\bcd /enum bootmgr

    Windows Boot Manager
    --------------------
    identifier              {bootmgr}
    device                  partition=F:
    description             Windows Boot Manager
    locale                  en-us
    inherit                 {globalsettings}
    default                 {default}
    displayorder            {default}
    toolsdisplayorder       {memdiag}
    timeout                 30
     

  18. Set recoveryenabled to Off and bootstatuspolicy to IgnoreAllFailures so normal startup is always performed, since Azure VMs do not have interactive console access.

    For example, if Step 14 showed the boot partition to be F:, you would run:

    bcdedit /store f:\boot\bcd /set {default} recoveryenabled Off
    bcdedit /store f:\boot\bcd /set {default} bootstatuspolicy IgnoreAllFailures
     

  19. Set device and osdevice to the appropriate drive letters if either shows as unknown under Windows Boot Loader.

    Run the following bcdedit command to view the boot loader entries. If the boot partition as identified in Step 14 is not F:, change it to the reflect the boot partition’s drive letter.

    bcdedit /store F:\boot\bcd /enum osloader

    If device or osdevice (or both) show unknown, set them to point to the OS partition you identified in Step 15.

    Here is an example where both of device and osdevice show unknown. In that situation you should set them to point to the OS partition.

    Windows Boot Loader
    -------------------
    identifier              {default}
    device                  unknown
    path                    \Windows\system32\winload.exe
    description             Windows Server 2012 R2
    locale                  en-US
    inherit                 {bootloadersettings}
    recoverysequence        {46ec6f24-b36a-11e5-80bb-00155d241af6}
    recoveryenabled         No
    osdevice                unknown
    systemroot              \Windows
    resumeobject            {60fcc223-a179-11e5-80b5-806e6f6e6963}
    nx                      OptOut

    IMPORTANT: Only proceed with the following steps if either device and osdevice (or both) show unknown under Windows Boot Loader.

    For example, if Step 14 showed the boot partition to be F: and Step 15 showed the OS partition to be drive E:, you would run the following command to set them to point to the OS partition:

    bcdedit /store F:\boot\bcd /set {default} device partition=E:
    bcdedit /store F:\boot\bcd /set {default} osdevice partition=E:
     

  20. Use the following commands to backup the existing SYSTEM registry hive and then revert it to the \Regback\SYSTEM copy. Windows automatically creates registry hive backups to \Regback every 10 days, so the steps below are restoring just the SYSTEM hive of the registry with a version that is up to 10 days old.

    Note: If the disk has two partitions, the registry will be on the larger of the two partitions.

    dir F:\Windows\System32\Config\RegBack\SYSTEM

    Important: if the file size for F:\Windows\System32\Config\RegBack\SYSTEM is 0 bytes, do not run the move and copy commands below, but continue with the remaining steps.

    move F:\windows\system32\config\system F:\windows\system32\config\system_org
    copy F:\windows\system32\config\Regback\system F:\windows\system32\config\system
      

  21. In Disk Management, right-click the disk from the problem VM and select Offline
     

      
  22. For Windows VMs, skip to Step 27. For Linux, in the troubleshooting VM, run the following commands:

    sudo fdisk -l

    ls /dev/sdc*

    A series of items will be returned that say /dev/sdcX, where X will be a number.
      

  23. Run the following command:

    mount | grep sdc
      

  24. If any lines are returned by Step 23, run sudo unmount /dev/sdcX where X is the number shown on the line from Step 23.
      
  25. Run mount | grep sda which will return output to similar to this:

    /dev/sda1 on /type ext4 (rw,discard)

    The highlighted portion is the file system that is in use by this OS.
     

  26. For each of the items in Step 22 run sudo fsck -t X /dev/sdcY where X is the highlighted value from Step 25 and Y is the number from Step 22.

    If you had additional data disks attached you may need to recover them individually. They may have different file systems depending on how they were created.
     

  27. The remaining steps apply both to Windows and Linux.

    Back in the portal, on the Dashboard for the troubleshooting VM, select Detach to detach the disk.
     

     

  28. Now recreate the problem VM by creating a new VM using that same OS disk that you just repaired. Start the VM and RDP or SSH to the VM. If any data disks were previously attached to the problem VM, you can attach them again now.  
     
  29. If your VM is still not recovered, you will have to rebuild your VM and import your application-specific data (if any) into the new VM by following these steps:

    Follow Steps 1-11 to get the old OS disk attached as a data disk to a new VM.

    Copy your application-specific data from the old OS disk to the new VM.

Comments (15)

  1. Andre Zanca says:

    Before Step 17, we must delete current disk and create a new one (using the same .VHD) with the option 'The VHD contains an operating system.' selected.

  2. Daniel says:

    i get stuck on step 5, my disk still attached to the deleted VM already spent more than 5 minutes and still attached to

  3. Zach says:

    Worked great, saved the day!

  4. Osvaldo says:

    Step 20: missing "/": sudo fsck -t X /dev/sdcY

  5. Pedro Rebelo da Silva says:

    This solution worked for us after 3 days of services down but should Azure users be doing this when the problems were caused by Azure? Probably not.

  6. Steven says:

    This also fixed the problem for one of our VMs that now has been down for 3 days.

    Console access to VMs could have allowed us to solve this ourselves … #1 requested feature on uservoice! Please get to it.

  7. Walaa says:

    Didn't work with me for 2 different VMs, one in WEurope and one in NEurope!!

  8. Chris McKee says:

    The fact ANY of this is required considering the issues are caused by faults in Azure is ridiculous. If I take my camera to be calibrated and they damage it I expect them to do the repair, they don't send me the god-damn instructions and ask me to do it myself.

  9. Jurjen Thie says:

    The fastest way to connect to your machines is to clone one of the initial machines by using the template and place it in the same virtual network. After cloning you can use remote desktop to connect to the original machine using this clone.

    This has helped me during a demo and saved my day 🙂

  10. Frank says:

    Chris, IaaS means Infrastructure is provided to you as a service. That doesn't mean the service provider is suddenly your server admin team. The health of the guest is your responsibility.

  11. Minh says:

    The steps 14 to 18 are confusing and reference the wrong step #

  12. A very helpful article Chris. Thanks for sharing. 🙂

  13. Martin says:

    Pleas, make Disk options available in the new Azure Portal sooner. I tried dozens of Powershell tutorials but they all were out of date or resulted in a VM which cannot be configured – Settings blade just gets stuck loading forever.

  14. Darren Ford says:

    What appeared to work for us was to Reset the RDP connection and then Stop and Restart the server in question. We did this through the Azure portal.

  15. DEM says:

    Before going through this article in attempts to get your VM back online, I would suggest trying a “Reset Remote Desktop” for your VM.
    In our case we could not access the VM via RDP nor could we ping it from local machine with-in the same network. This VM also hosted a SharePoint Application Server instance and was configured run the bulk of the farms services. Long story short – Log into the new Azure Portal manager > Select the VM in question > In the next blade, select “Reset Remote Desktop” tab. Now, Shut down you machine and Start it back up again. It took a minute or 2 but this corrected the issue for us. Hope this helps and good luck!!

Skip to main content