Azure GPU Tensorflow Step-by-Step Setup


The following guide has been developed in collaboration with my colleague at Microsoft Christine Matheney and our work at Oxford and Stanford University.

  • This guide will walk you through running your code on GPUs in Azure.
  • Before we start, it cannot be stressed enough: do not leave the VM running when you are not using it see the following blog on tips for automating and shutting down VMs to save costs.
  • The expected time from start to finish is 1-2 hours.
  • The most time consuming part will be downloading and installing NVIDIA drivers, CUDA and Tensorflow this guides and repo installs TensorFlow 1.0.

FAQ

  • As an administrator (Lead TA/RA or Academic) you need to grant/remove access for an individual (student) follow the directions here and setting up Azure at your institution
  • Do not install updates using: sudo apt-get install –upgrade This might break the CUDA driver installation if the kernel is updated.
  • If you need to attach additional storage or a larger disk to your VM see. https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-classic-attach-disk
  • To check available disk space Run df -h to see which disks have free space.
  • Please only store your data to the attached disk. The temporary disk provided on Azure VMs are not available to store persistent data.
  • Problems connecting (e.g., using SSH) to the VM
    • Try ping <vm’s ip address>
    • Try ssh to the VM
    • Try restarting the VM and/or your local machine
  • If all of the previous steps fail, file an Azure support ticket via http://portal.azure.com

Creating a Microsoft account

  • You should have received an email to your inbox with an invitation to join the Azure subscription from your Azure Administrator.
  • Please follow the instructions using the email address that received this invitation.

Getting started

Logging into Azure portal
  • Once you have created your account, log in to Azure at: http://portal.azure.com
  • After logging in, you should reach the dashboard page.
  • If you have multiple subscriptions (e.g., you previously signed up for a free one), you must select the name of your institution. by clicking in the top right quarter. If no such option appears please contact your Azure Admin.

Create a VM

  • Once you are logged in, click on the + on the left. Select Ubuntu Server 16.04 LTS.

Azure Portal VM Selection

  • You will be presented withe VM Image details simply Click Create.

Azure Portal VM Selection

  • Fill in the name, user, etc for your VM. You must change the storage type from SSD to HDD. Also, you must use the region that is available for NC or NV

    For info NV and NC are available in the below regions

    Region

    SKU

    East US

    NV

    North Central US

    NV

    South Central US

    NV

    South East Asia

    NV

    West Europe

    NV

    South Central US

    NC

    East US

    NC

  • Regarding the question of running GPU compute for deep learning on NV-Series, the GPU team has indicated that is not recommended. Bottom line is:  Big GPU Computes (like deep learning) should only be done on NC-Series. NV is for visualization and graphics. See this blog for more details on NV vs NC series

Azure Portal VM Selection

  • View all (click the button) of the options and select an appropriate NV or NC Series series server for your workload.  By scrolling through the list. If NV or NC does not show up, then you probably chose the wrong region or have selected a SSD not a HDD, in the previous page Step 1 Basic.
  • If you do not select  NV/NC options, then you are not using a GPU instance and the setup scripts later will fail.

Azure Portal VM Selection

  • Select the appropriate VM Size and Click OK.

Azure Portal VM Selection

  • Wait for the configuration to validate and then click OK.

Azure Portal VM Selection

Using the VM

Finding your VM

Login to http://portal.azure.com Click all resources and select your VM. Our subscription has many, but yours will only have one if you just followed the setup instructions.

Azure Portal VM Selection

Spinning up your VM

If you just completed the previous part and the VM has finished deploying, then your VM should be running already.

Azure Portal VM Selection

Connecting (SSH) to your VM

Once your VM is started (it may take a few minutes). Click connect and follow the instructions.

Azure Portal VM Selection

Stopping your VM

  • Once you are done working, stop your VM. see  this blog on tips for stopping/shutting down VMs
  • Make sure your VM is fully stopped. If you see “stopped still incurring compute charges”, you must hit stop again.

Azure Portal VM Selection

Azure Portal VM Selection

Completing CUDA/Tensorflow setup

  • You will need to SSH into your VM.

##Installing CUDA and Tensorflow dependencies.

There are two scripts that you will need to run and your VM will need to reboot in the between running them.

##[Step 1]

./gpu-setup-part1.sh

This will install some libraries, fetch and install NVIDIA drivers, and trigger a reboot. (The command will take some time to run.)

Once your VM has finished restarting.

[Step 2]

SSH into the VM again. Navigate to the azure-gpu-setup directory again. Run the command:

./gpu-setup-part2.sh

This script installs the CUDA toolkit, CUDNN, and Tensorflow. It also sets the required environment variables. Once the script finishes, we must do:

source ~/.bashrc

This ensures that the shell will use the updated environment variables. Now, to test that Tensorflow and the GPU is properly configured, run the gpu test script by executing:

python gpu-test.py

Azure Portal VM Selection

Filing a support ticket

  • Click on the help icon in the left sidebar and select new support request.

Azure Portal VM Selection

  • Follow the on screen instructions.


Azure Portal VM Selection

General recommendations

We highly suggest the following for using the GPU instances:

  • Develop and debug your code locally and use scp to copy your code to the VM to run for the long training steps.
  • Save your work often and keep a local copy.
  • Be mindful of when your instance is running and shut it off when you are not actively using it.
Comments (6)

  1. Lee Stott says:

    An Alternative to this setup is to simply use the Azure Data Science DeepLearning prebuilt VM.
    The data science virtual machine (DSVM) on Azure, based on Windows Server 2012, or Linux contains popular tools for data science modeling and development activities such as Microsoft R Server Developer Edition, Anaconda Python, Jupyter notebooks for Python and R, Visual Studio Community Edition with Python and R Tools, SQL Server Developer edition, and many other data science and ML tools. Use the DSVM to jump-start modeling and development for your data science project.

    This deep learning toolkit provides GPU versions of mxnet, CNTK, TensorFlow, and Keras for use on Azure GPU N-series instances. These GPUs use discrete device assignment, resulting in performance that is close to bare-metal, and are well-suited to deep learning problems that require large training sets and expensive computational training efforts. The deep learning toolkit also provides a set of sample deep learning solutions that use the GPU, including image recognition on the CIFAR-10 database and a character recognition sample on the MNIST database. GPU instances are currently available in South Central US, East US, West Europe, and Southeast Asia.
    Deploying this toolkit requires access to Azure GPU NC-class instances. see https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning?tab=Overview

  2. Lee Stott says:

    If you require more than 18 servers you can request a quota extension

     Go to https://ms.portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/overview
     Click on + New support request
     Issue type = Quota
     Choose your subscription (if you have more than one)
     Follow the prompts/fillin the fields

  3. Lars Hulstaert says:

    Hi Lee,
    thanks a lot for the excellent guide!

  4. Lee Stott says:

    If you get the following error

    “ERROR: Unable to load the ‘nvidia-drm’ kernel module.
    ERROR: Installation has failed. Please see the file
    ‘/var/log/nvidia-installer.log’ for details. You may find
    suggestions on fixing installation problems in the README available

    Check the following..

    Ensure you have selected NC’s Virtual Machines.

    This error is presented when you try to install on NV machine as described above NV are for visualisation

  5. Lee Stott says:

    If your interested in running tensorflow from a container/docker solution infrastructure the following tutorial and github resources are a prefect starting point. http://wbuchwalter.github.io/container/docker/machine/learning/kubernetes/gpu/training/2016/03/23/gpu-ml-training-cluster/

Skip to main content