The Microsoft Data Science Virtual Machine is an Azure virtual machine (VM) image pre-installed and configured with several popular tools that are commonly used for data analytics and machine learning. The tools included are:
- Microsoft R Server Developer Edition
- Anaconda Python distribution
- Jupyter Notebooks
- Azure Machine Learning
- Cortana Sample Gallery
- Microsoft Azure Poweshell
- Git Bash
- SQL Management
- Visual Studio Community Edition,
- Power BI desktop
- SQL Server Express edition
- Azure SDK
The Microsoft Data Science Virtual Machine jump starts your analytics project. It enables you to work on tasks in a variety of languages including R, Python, SQL, and C#. Visual Studio provides an IDE to develop and test your code that is easy to use. The Azure SDK included in the VM allows you to build your applications using various services on Microsoft’s cloud platform.
There is no software charges for this data science VM image. You only pay for the Azure usage fees which is dependent on the size of the virtual machine you will be provisioning with this VM image. More details on the compute fees can be found here.
What is Data Science Virtual Machine?
Its a Custom VM image for Windows or Linux hosted on the Azure Marketplace
Contains a set of data science, Azure tools/SDKs
- All pre-configured and ready to use
- Pay for cloud hardware usage only. No separate software charges!
- Pointers to gallery, samples, documentation
- Windows and Linux Versions
- Up and running quickly
Onpremise Analytics desktop replacement in the cloud
- Consistent setup across team, promote sharing and collaboration
- Azure scale and management
- Near-Zero Setup
Data Science Training and education
- Consistent setup, ease of support
- On Demand, Shared / dedicated infrastructure
- Quick, Low friction startup
Dedicated on-demand elastic capacity for large workloads
- Ability to run analytics not feasible on desktop or on shared environment
- Pay for what you use
- Eg: Hackathons, Competitions
Short Experiments & Evaluation
- Quick, Low Friction startup
- Spend time evaluating instead of setup
- Try before you buy
- Replicate a published experiment
Before you can create a Microsoft Data Science Virtual Machine, you must have the following:
- An Azure subscription: To obtain one, see Get Azure free trial.
- An Azure storage account: To create one, see Create an Azure storage account Alternatively, the storage account can be created as part of the process of creating the VM if you do not want to use an existing account.
Here are the steps to create an instance of the Microsoft Data Science Virtual Machine:
- Navigate to the virtual machine listing on Azure Portal.
- Click on the Create button at the bottom to be taken into a wizard.
3. The wizard used to create the Microsoft Data Science Virtual Machine requires inputs for each of the 5 steps enumerated on the right of this figure. Here are the inputs needed to configure each of these steps:
- Name: Name of your data science server you are creating.
- User Name: Admin account login id
- Password: Admin account password
- Subscription: If you have more than one subscription, select the one on which the machine will be created and billed
- Resource Group: You can create a new one or use an existing group
- Location: Select the data center that is most appropriate. Usually it is the data center that has most of your data or is closest to your physical location for fastest network access
b. Size: - Select one of the server types that meets your functional requirement and cost constraints. You can get more choices of VM sizes by selecting “View All”
c. Settings - Disk Type: Choose Premium if you prefer a solid state drive (SSD), else choose “Standard”. - Storage Account: You can create a new Azure storage account in your subscription or use an existing one in the same Location that was chosen on the Basics step of the wizard. - Other parameters: In most cases you will just use the default values. You can hover over the informational link for help on the specific fields in case you want to consider the use of non-default values.
d. Summary: - Verify that all information you entered is correct.
e. Buy: - Click on Buy to start the provisioning. A link is provided to the terms of the transaction. The VM does not have any additional charges beyond the compute for the server size you chose in the Size step.
The provisioning should take about 10-20 minutes. The status of the provisioning is displayed on the Azure Portal.
Once the VM is created you can login to it using remote desktop with the Admin account credentials you created in a. Basics section
Once your VM is created and provisioned, you are ready to start using the tools that are installed and configured on it. There are start menu tiles and desktop icons for many of the tools.
Run the following command from the a command prompt on the Data Science Virtual Machine to create your own strong password for the Jupyter notebook server installed on the machine.
c:\anaconda\python.exe -c "import IPython;print IPython.lib.passwd()"
Choose a strong password when prompted.
You will see the password hash in the format "sha1:xxxxxx" in the output. Copy this password hash and replace the existing hash that is in your notebook config file located at: C:\ProgramData\jupyter\jupyter_notebook_config.py with a parameter name c.NotebookApp.password.
You should only replace the existing hash value that is within the quotes. The quotes and the sha1: prefix for the parameter value need to be retained.
Finally, you need to stop and restart the Ipython server which is running on the VM as a windows scheduled task called "Start_IPython_Notebook". If your new password is not accepted after restarting this task, try restarting the virtual machine.
Adding Authentication and Additional users
If you want to configure Multiple User/OAuth or AAD connectivity to you Data Science VM
If you wish to use R for your analytics, the VM has Microsoft R Server Developer edition installed. Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling. By using and extending open source R, Microsoft R Server is fully compatible with R scripts, functions and CRAN packages, to analyze data at enterprise scale. It also addresses the in-memory limitations of Open Source R by adding parallel and chunked processing of data in Microsoft R Server, enabling users to run analytics on data much bigger than what fits in main memory. An IDE for R is also packaged in the VM that can be accessed by clicking the icon "Revolution R Enterprise 8.0" on the start menu or the desktop. You are free to download and use other IDEs as well such as RStudio.
For development using Python, Anaconda Python distribution 2.7 and 3.5 has been installed. This distribution contains the base Python along with about 300 of the most popular math, engineering and data analytics packages. You can use Python Tools for Visual Studio (PTVS) that is installed within the Visual Studio 2015 Community edition or one of the IDEs bundled with Anaconda like IDLE or Spyder. You can launch one of these by searching on the search bar (Win + S key). Note: In order to point the Python Tools for Visual Studio at Anaconda Python 2.7 and 3.5, you need to go create custom environments for each version by navigating to Tools -> Python Tools -> Python Environments and then clicking "+ Custom" in the Visual Studio 2015 Community Edition and setting the environment paths. Anaconda Python 2.7 is installed under C:\Anaconda and Anaconda Python 3.5 is installed under c:\Anaconda\envs\py35. See PTVS documentation for detailed steps.
Anaconda distribution also comes with an Jupyter notebook, an environment to share code and analysis. An Jupyter notebook server has been pre-configured with Python 2, Python 3 and R kernels. There is a desktop icon named "Jupyter Notebook to launch the browser to access the Notebook server. If you are on the VM via remote desktop you can also visit https://localhost:9999/ to access the Jupyter notebook server (Note: Continue if you get any certificate warnings.). We have packaged sample notebooks - one in Python and one in R. You can see the link to the samples on the notebook home page after you authenticate to the Jupyter notebook using the password you created in earlier step.
Visual Studio Community edition installed on the VM. It is a free version of the popular IDE from Microsoft that you can use for evaluation purposes and for very small teams. You can check out the licensing terms here. Open Visual Studio by double clicking the desktop icon or the Start menu. You can also search for programs with Win + S and entering “Visual Studio”. Once there you can create projects in languages like C#, Python. You will also find plugins installed that make it convenient to work with Azure services like Azure Data Catalog, Azure HDInsight (Hadoop, Spark) and Azure Data Lake.
Note: You may get a message stating that your evaluation period has expired. You can enter a Microsoft Account credentials or create one and enter them to get access to the Visual Studio Community Edition.
A limited version of SQL Server is also packaged with Visual Studio Community edition. You can access the SQL server by launching SQL Server Management Studio. Your VM name will be populated as the Server Name. Use Windows Authentication when logged in as the admin on Windows. Once you are in SQL Server Management Studio you can create other users, create databases, import data, and run SQL queries.
Several Azure tools are installed on the VM: - There is a desktop shortcut to access the Azure SDK documentation. - AzCopy used to move data in and out of your Microsoft Azure Storage Account. - Azure Storage Explorer used to browse through the objects that you have stored within your Azure Storage Account. - Microsoft Azure Powershell - a tool used to administer your Azure resources in the Powershell scripting language is also installed on your VM.
To help you build dashboards and great visualizations, the Power BI Desktop has been installed. Use this tool to pull data from different sources, to author your dashboards and reports, and to publish them to the cloud. For information, see the Power BI site.
Note: You will need an Office 365 account to access Power BI.
The Microsoft Web Platform Installer can be used to discover and download other Microsoft development tools. There is also a shortcut to the tool provided on the Microsoft Data Science Virtual Machine desktop.
Here are some next steps to continue your learning and exploration.
- Explore the various data science tools on the data science VM by clicking on the start menu and checking out the tools listed on the menu
- Navigate to C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\demoScripts for samples using the RevoScaleR library in R that supports data analytics at enterprise scale.
- Read the article: Ten things you can do on the Data science Virtual Machine
- Learn how to build end to end analytical solutions systematically using the Team Data Science Process
- Visit the Cortana Intelligence Gallery for machine learning and data analytics samples using the Cortana Intelligence Suite. We have also provided an icon on the Start menu and desktop on the virtual machine for easy access