Azure Machine Learning and Python

Article
04/22/2016

This week myself and few colleagues presented a number of guest lectures to UK Universities focusing around the new Cortana Intelligence Suite and its components and how students could utilise this new technology in academic projects or hackathons

Amy Nicholson from The UK DX Evangelist team, leads on Azure Machine Learning and Data Science has given a number of presentation and demos on Azure Machine Learning. You can find some of Amy’s presentation on Amy GitHub account at https://github.com/amykatenicho/AzureMachineLearningResources

One of the most common questions Amy has had from Students and Educators around Azure machine Learning was the use of Python vs R in undertaking Machine Learning Experiments..

The following blog covering some of the key areas around Python and Azure Machine Learning.

Can Python be used in Azure Machine Learning?

Well the good news is YES you can use Python within Azure Machine Learning Studio. The primary interface to Azure Machine Learning Studio utilises the Execute Python Script module. The Execute Python Script module allows a data scientist to incorporate existing Python code into cloud-hosted machine learning workflows in Azure Machine Learning and to seamlessly operationalize them as part of a web service. The Python script module interoperates naturally with other modules in Azure Machine Learning and can be used for a range of tasks from data exploration to pre-processing, to feature extraction, to evaluation and post-processing of the results. The backend runtime used for execution is based on Anaconda, a well-tested and widely-used Python distribution. This makes it simple for users to onboard existing code assets into the cloud.

Python module are exposed as Pandas data frames. More information on Python Pandas and how it can be used to manipulate data effectively and efficiently can be found in Python for Data Analysis (Sebastopol, CA.: O'Reilly, 2012) by W. McKinney. The function must return a single Pandas data frame packaged inside of a Python sequence such as a tuple, list, or NumPy array. The first element of this sequence is then returned in the first output port of the module.

Anaconda environment is installed in Azure Machine Learning and contains the following packages

NumPy
SciPy
Scikits-Learn

These can be effectively used for various data processing tasks in a typical machine learning pipeline.

The Module accepts three inputs and produces up to two outputs, so this is the same as the R analogue, the Execute R Script module.

The Python code to be executed is entered into the parameter box as a specially-named entry-point function called azureml_main.

What are the key design principles used to implement the Python Execute Module?

Must be idiomatic for Python users. Most Python users factor their code as functions inside modules, so putting a lot of executable statements in a top-level module is relatively rare. As a result, the script box also takes a specially named Python function as opposed to just a sequence of statements. The objects exposed in the function are standard Python library types such as Pandas data frames and NumPy arrays.
Must have high-fidelity between local and cloud executions. The backend used to execute the Python code is based on Anaconda 2.1, a widely-used cross-platform scientific Python distribution. It comes with close to 200 of the most common Python packages. Therefore, a data scientist can debug and qualify their code on his or her local Azure Machine Learning compatible Anaconda environment using existing development environments such as IPython notebook or Python Tools for Visual Studio and run it as part of an Azure Machine Learning experiment with high confidence. Further, the azureml_main entry point is a vanilla Python function and can be authored without Azure Machine Learning specific code or the SDK installed.
Must be seamlessly composable with other Azure Machine Learning modules. The Execute Python Script module accepts, as inputs and outputs, standard Azure Machine Learning datasets. The underlying framework transparently and efficiently bridges the Azure Machine Learning and Python runtimes (supporting features such as missing values). Python can therefore be used in conjunction with existing Azure Machine Learning workflows, including those that call into R and SQLite. One can therefore envisage workflows that:

use Python and Pandas for data pre-processing and cleaning,
feed the data to a SQL transformation, joining multiple datasets to form features,
train models using the extensive collection of algorithms in Azure Machine Learning, and
evaluate and post-process the results using R.

What are the Limitations of Python within Azure ML?

The Execute Python Script currently has the following limitations:

Sandboxed execution. The Python runtime is currently sandboxed and, as a result, does not allow access to the network or to the local file system in a persistent manner. All files saved locally are isolated and deleted once the module finishes. The Python code cannot access most directories on the machine it runs on, the exception being the current directory and its sub-directories.
Lack of sophisticated development and debugging support. The Python module currently does not support IDE features such as intellisense and debugging. Also, if the module fails at runtime, the full Python stack trace is available, but must be viewed in the output log for the module. We currently recommend that users develop and debug their Python scripts in an environment such as IPython and then import the code into the module.
Single data frame output. The Python entry point is only permitted to return a single data frame as output. It is not currently possible to return arbitrary Python objects such as trained models directly back to the Azure Machine Learning runtime. Like Execute R Script, which has the same limitation, it is however possible in many cases to pickle objects into a byte array and then return that inside of a data frame.
Inability to customize Python installation. Currently, the only way to add custom Python modules is via the zip file mechanism described earlier. While this is feasible for small modules, it is cumbersome for large modules (especially those with native DLLs) or a large number of modules.

Another question Amy had was..

Can you import existing Python Script Modules?

Well again the answer is yes.

A common use-case for many data scientists is to incorporate existing Python scripts into their Machine Learning experiments.

Instead of concatenating and pasting all the code into a single script box, the Execute Python Script module accepts a third input port to which a zip file that contains the Python modules can be connected.

The file is then unzipped by the execution framework at runtime and the contents are added to the library path of the Python interpreter. The azureml_main entry point function can then import these modules directly.

To load the python script in this sample experiment we need to use user-defined Python code uploaded as a zip file.

 #This Script MUST include the following Function
 #This is the entry point for this module
 #Param<dataframe1>:as pandas.DataFrame
 #Param<dataframe2>:as pandas.DataFrame
  
 def azureml_main():
     import pandas as pd
     import Hello
     Hello.print_hello("world")
     return pd.DataFrame(["Output"]),

The module output window simply shows that the zip file has been unpackaged and the funtion pirint_hellow has been run.

Can Plots be Visualised?

Plots created using MatplotLib that can be visualized on the browser can be returned by the Execute Python Script. However plots are not automatically redirected to images as they are when using R. So the user must explicitly save any plots to PNG files if they are to be returned back to Azure Machine Learning.

In order to generate images from MatplotLib, you must compete the following procedure:

switch the backend to “AGG” from the default Qt-based renderer
create a new figure object
get the axis and generate all plots into it
save the figure to a PNG file

This code below creates a scatter plot matrix using the scatter_matrix function in Pandas.

 def azureml_main(dataframe1):
     import matplotlib
 #change background
     matplotlib.use("agg")
  
     from pandas.tools.plotting import scatter_matrix
     import matplotlib.pyplot as plt
  
 #create a new figure
     fig =plt.figure();
     ax = fig.gca()
  
 #Plot figure into specificed axis
     scatter_matrix(dataframe1, ax=ax)
  
 #Save Figure to a image
     fig.savefig("Scatter.png")
     return dataframe1

So as an example you could return plots via the second output port. Also it is possible to return multiple figures by saving them into different images, the Azure Machine Learning runtime picks up all images and concatenates them for visualization.

Over the coming months, we expect to provide additional functionality to the Execute Python Script module such as the ability to train and operationalize models in Python and to add better support for the development and debugging code in Azure Machine Learning Studio.For more information, see the Python Developer Center.

Resources

Azure Machine Learning Blog – https://blogs.technet.com/b/machinelearning

Analytics and Cortana Intelligence Suite https://azure.microsoft.com/en-us/documentation/learning-paths/cortana-analytics-process

Machine learning tutorial https://github.com/amykatenicho/AzureMachineLearning.git

Azure Educator Grant for Educators and Students wishing to use Azure Machine learning https://aka.ms/azureforeducation

Azure Machine Learning and Python

Additional resources