Simplifying The Use of Azure Data Science Virtual Machine with R

This post is authored by Le Zhang, Data Scientist, and Graham Williams, Director of Data Science at Microsoft.

Azure Data Science Virtual Machine (DSVM) and AzureDSVM package

Azure Data Science Virtual Machine (DSVM) is a curated Azure VM image preinstalled and configured with popular tools that are commonly used for data analytics and machine learning, including Microsoft R Server Developer Edition, Anaconda Python distribution, Jupyter notebook (with R, Python kernels), etc. The DSVM is a desirable workplace for experimental analytics on a single low-end VM, collaborative prototyping of machine learning proof-of-concepts, or operationalizing an end-to-end data science or AI workflow.

For R-based data scientists, data engineers, and architects, it is beneficial to use and operate DSVMs for various application scenarios with minimal effort. AzureDSVM (version 0.2.0) is an R package that helps R users to directly manage the DSVM from within an R session. It provides a comprehensive set of operational functions to:

  • Deploy, operate, use, and destroy a DSVM with customized information such as, machine name, machine size (with compute/memory optimized general-purpose CPU, Nvidia K80/M60 GPU, etc.), operating system (Windows Server 2016, Ubuntu 16.04, and CentOS), authentication method (public key based or password based), etc.;
  • Enjoy all benefits of a Windows/Linux DSVM. E.g., all tools for data science work such as R/Python/Julia programming languages, SQL Server, Visual Studio with RTVS, etc., remote working environment via RStudio Server or Jupyter Notebook interface, and machine learning & artificial intelligence packages such as Microsoft CNTK, MXNet, and XGBoost;
  • Convenient one-box configuration for Microsoft R Server on the remote DSVM and seamless interaction with the remote R Server session using "mrsdeploy" functions;
  • Post-deployment installation of extensions for customizing system environment, reinstalling/uninstalling software, etc.;
  • Deployment of a collection of heterogeneous DSVMs for a group of data scientists;
  • Scale up a DSVM and form them into a cluster for parallel/distributed computation with RevoScaleR computing contexts;
  • Monitor data consumption and estimate expense of using DSVM(s) with hourly or daily aggregation granularity.

Install and use AzureDSVM

AzureDSVM is an open source R package which is hosted on GitHub. To install, one simply runs the following code:

devtools::install_github("Azure/AzureDSVM")

AzureDSVM relies on AzureSMR, with the latter offering methods to authenticate against an authorized Application on Azure Active Directory, and manages a selected set of Azure components. The same preliminary steps for the set up of AzureSMR also applies to AzureDSVM. Detailed instructions can be found here.

Get started

After installation one can try out sample code from tutorials in the various provided vignettes. For example, the following deploys a Ubuntu Linux DSVM with given the specifications of the authentication method, VM size, etc. The context is the active authentication context used in AzureSMR. Other arguments are specifications of the DSVM itself.

 deployDSVM(context, resource.group="<resource_group>", location="<location>", hostname="<dsvm_name>", username="<user_name>", size="<dsvm_size>",  pubkey="<public_key>")

We can stop or delete a DSVM when it is no longer required. Stopping the operation from AzureDSVM both shuts down the DSVM and also deallocates it. This means that there is no charge any more associated with the DSVM.

Use case scenarios of DSVM by using AzureDSVM

Many application scenarios can benefit from the methods provided by AzureDSVM which allow functional and elastic operation of the DSVMs from within an R session. To illustrate, the following are some representative design patterns we have identified in customer project engagements, which target specific scenarios regardless of business domain or context.

  1. The R using data scientist can easily interact with a single remote DSVM from within an R session for prototyping and experimenting a proof-of-concept, with varieties of analytical tools/software (Microsoft R Server, CNTK, xgboost, etc.) and hardware configurations (compute or memory optimized general purpose CPU, K80 or M60 GPU, etc.). This is useful for self-learning data science and AI tools/algorithms, prototype development, pre-production testing, etc. as one can resize (scale-up and scale-down), operate/compute, or destroy the DSVM as desired.
  2. Big data analytics can be scaled out across a cluster of high performance DSVMs with RevoScaleR parallel computing contexts. This can be applied for tasks such as large-scale simulation, model ensemble, distributed machine learning, embarrassingly parallel computation. Note it is preferable to run a large-scale (> 1000 nodes) distributed computation with Azure Batch Service by using the doAzureParallel package, while clustering DSVM with AzureDSVM package provides a flexible approach for modest-scale distributed computation.
  3. Architects or administrators can also use the package to create collaborative environments where resources such as storage, etc., are shared, for a group of data scientists, each of who owns one or more DSVMs for specific tasks. Applications of this scenario include co-development of data science projects, data science or AI related hands-on workshop, etc.
  4. For more sophisticated use cases, an end-to-end data science or AI development pipeline can be constructed, in which each of the deployed DSVMs may serve in a task-specific role (e.g., data processing, model creating, web server hosting, etc.). This is often seen in enterprise-grade applications where operationalization of a prototype or experimental solution in scale is needed.

In a real-world use case, it is common that the four patterns are mixed to form a more sophisticated architecture. Illustrative examples in a use case of Flight Delay Prediction and another one of Solar Panel Power Forecasting are referred for more information. The former exhibits an end-to-end development pipeline built on top of a set of heterogeneous DSVMs, and the latter shows configuring a DSVM for deep learning with Microsoft Cognitive Toolkit in R. Note DSVM is designed more for prototyping and experimenting work, so once an established pipeline is transformed into a production one, DSVM instances can be replaced with other suitable Azure components (e.g., HDInsight if Hadoop/Spark cluster is needed). The replacement can be conveniently achieved with AzureSMR.

Le Zhang, Graham Williams