Utilising Microsoft Data Science Virtual Machine DSVM for your labs

 

image image

 

The data science virtual machine (DSVM) is an Virtual Machine image for Windows or Linux offered on Azure by from Microsoft providing several essential data science and analytics tools I previously taken about setting up the DSVM and produced a short walkthrough of getting started at https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/08/30/microsoft-data-science-virtual-machine-for-windows-and-linux-now-available/ 

Here is a full set of Resources available on the DSVM image ( 9th Sept 2016)

wp_ss_20160909_0001


Architectures for analytics environment

The Virtual Machine It is offered in both Windows and Linux editions. You only pay for the Azure compute charges (same as what you would pay for the base OS virtual machine image that depends on number of cores and memory). The DSVM is an ideal environment to run your training or education program for data science since it provides readily installed and pre-configured set of tools you would need for a wide range of requirements. It also has SDKs and tools that can connect to various Azure and Microsoft big data services like Hadoop (HDInsight), Spark, Azure SQL Data warehouse, Storage blobs, Azure Data Lake).

There are different ways in which you can architect your analytics environments using the DSVM. The two main approaches are:

· Option 1: One dedicated VM per student

· Option 2: Shared instances.

Dedicated VM per student architecture

In this architecture, you provision one DSVM on Azure for each student. The student will have full access to the machine and all the work is stored on the specific VM. Using Git, the code can be archived on a Git source code repo like Github. Optionally all the VMs can mount a shared Azure disk that contains large shared datasets to avoid keep multiple copies of the data. For dedicated instances you can use either Windows or Linux edition of the data science VM. If you use Visual studio or need a comprehensive development IDE then the windows edition is recommended. Linux does come with Eclipse, Spyder (for Python), and code editors like emacs and gedit but does not have the rich set of plugins like Hadoop HDInsight tools, Azure data lake tools, R or Python tools for Visual studio.

The VM image is fixed to the current implementation scheme of DSVM. You can do Windows, Visual Studio, yum etc updates.

As the DSVM is constantly having Data Science tools implemented its lways best to use the lates release of the Virtual Machine from the Azure Gallery.  For example Visual Studio Update 3 over the previous version will take over an hour to download and install). It is faster (and potentially safer) to spin up a new VM (assuming you did not customize it too much or install  a bunch of other software).

One best practice for DSVM usage we have seen is to treat them as “disposable compute” that is attached to code repos, disks / Azure file storage that contain the analytics and developed assets. This way when a new version of DSVM image is released you spin up a new VM and attach storage and pull code from repos like Github, VSTS.

Steps to implement dedicated VM architecture

The following are the high steps to create a DSVMs with the dedicated VM architecture for your class.

1. Estimate the number of students for the class

2. Use Azure CLI to create multiple VMs in one shot using an ARM template. (Suggest a minimum 2 core, 7GB memory instance for each student). A sample ARM template that creates multiple VMs can be found here for Windows and here for Linux. In future we will also provide a graphical tool for this purpose. If students have access to Azure subscription (Free 30 day trial available) with rights to create VM instances, they can create their own instance of DSVM using ARM templates or the Azure portal GUI found here for Windows and here for Linux.

3. Optionally you can use Azure CLI or Powershell to set a unique login and password for each of the created VM instances

4. Load course materials (Jupyter notebooks, Scripts etc) on a Git repository like GitHub or Visual Studio team services.

5. Provide the credentials to the students through an appropriate mechanism (offline for a face to face class or private secure email or through a student registration site). If using Jupyter each student needs to create a Jupyter password on their instance using instructions found here.

6. Students download the course materials by running a Git clone on command line or a tool like Visual Studio. If you are using Jupyter notebooks, the students can import the notebooks on their local Jupyter server by clicking the upload button and specifying the location of the file within the directory where the Git repo is cloned. In other cases, open the favourite IDE or editor to work on the code.

7. Students make changes to the notebook as appropriate on their machine. It is a good practice to manage the source code on their own Git repository which can be hosted on GitHub or Visual Studio Team Services using the Git tools on the DSVM.

8. Most popular software needed for data science courses are already preloaded on the DSVM. If you see any specific software or version you need missing you (or student) can install it logging in as administrator on the VM.

Shared Instances architecture

In this architecture, you provision a small number of DSVMs on Azure for the class and create multiple user accounts for each student. The students will have their own dedicated area to store their work and use Git to archive and manage code revisions. The student access the course either by logging into the OS or using a Jupyter notebook if you are using only R OR Python only for the class. For shared instance architecture we recommend using Linux data science virtual machine since it comes with JupyterHub which is a multi user version of Jupyter notebooks. If you need them, you can install RStudio or RStudio server (These are not pre-built into the Linux DSVM).

Steps to implement shared instances architecture

The following are the high steps to create a DSVMs with the shared instances architecture for your class.

1. Estimate the number of students for the class. Plan about 25 students per VM instances (Recommended 8 Core, 56GB memory for each instance. Number may vary based on exact workload and course content).

2. Use Azure CLI to create multiple VMs in one shot using an ARM template. A sample ARM template that creates multiple VMs can be found here for Windows and here for Linux.  If you are creating just one or two instances you can also use the Azure portal GUI found here for Windows and here for Linux.

3. Login to the VMs and create unique OS local non-admin login accounts and password for each of student. Here is sample bash script:

 # Create users and generate random password. Run as root:
 for i in {1..40} # 40 users
 do
 u=`openssl rand -hex 2`;
 useradd user$u;
 p=`openssl rand -hex 5`;
 echo $p | passwd user$u --stdin;
 echo user$u, $p >> 'usersinfo.csv'
 done

4. Load course materials (Jupyter notebooks, Scripts etc) on a Git repository like GitHub or Visual Studio team services.

5. Provide the VM OS login credentials to the students through an appropriate mechanism (offline for a face to face class or private secure email or through a student registration site).

6. Students download the course materials by running a Git clone on command line. If you are using notebooks, the students can login to their Jupyter instance hosted on JupyterHub on the VM using the provided OS credentials. Once they are logged into Jupyter they can import the notebooks on their local Jupyter server by clicking the upload button and specifying the location of the file within the directory where the Git repo is cloned. In other cases, open the favourite IDE or editor to work on the code.

7. Students make changes to the notebook as appropriate on their machine. It is a good practice to manage the source code on their own Git repository which can be hosted on GitHub or Visual Studio Team Services using the Git tools on the DSVM.

8. Most popular software needed for data science courses are already preloaded on the DSVM. If you see any specific software or version you need missing you can install it logging in as administrator on the VM.

Be effective with Azure Costs

To stop billing compute usage charges on VM instance when not in use, shutdown the VM from Azure portal or Azure CLI. This can also be an effective mechanism to archive the whole course including work the students have done. (A small storage fee for the image will apply in this case even though no compute usage is being charged. You will have to delete the VM to avoid the storage fee also for the VM image).

Resources

Windows DSVM:

Marketplace URL: https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/standard-data-science-vm/

Documentation: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-provision-vm/

Article/Tutorial - Ten things you can do on the DSVM: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-vm-do-ten-things/

Linux DSVM :

Marketplace URL: https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/linux-data-science-vm/

Documentation: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-linux-dsvm-intro/

Videos: https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Inside-the-Data-Science-Virtual-Machine  (Webinar – 1 Hour)

Feedback

It would be nice to hear how your using DSVM instance or any other custom VM images in your teaching