Choosing an Azure Machine Learning Algorithm

Azure Machine Learning AlgorithmsWhen getting started with Azure Machine Learning, the hardest part for many developers is staring down the list of Azure machine learning algorithms (there are currently 25 of them) and trying to figure out which one would work best.  In this blog post, I will provide some resources for helping you choose the right algorithm. 

At a high level, there are currently 4 categories of algorithms available in Azure Machine Learning. 

  1. Clustering
  2. Regression
  3. Classification
  4. Anomaly Detection

Let’s briefly walk through each of these. 

Clustering
Scenario: Group similar toys together, to be used for gift recommendation service.  

Clustering is a type of unsupervised learning, meaning we don’t have labeled examples or data mappings in advance for it to learn from.  A clustering algorithm can find patterns in your data to form groupings of like data.  

Regression
Scenario: Given a set of data for a house like its square footage, zip code, number of bedrooms, number of bathrooms, and price, be able to predict prices of other houses with their square footage, zip code, number of bedrooms, and number of bathrooms.  

Regression is a type of supervised learning, meaning you train the models with data that includes features (inputs) and a label (output), and the algorithm can determine the correlations between them. 

With supervised learning, you need data with labeled examples.  For example, let’s say I had really simple data like this (eliminating the number of bedrooms and bathrooms from the example above):

SquareFootage ZipCode Price
2000 48075 200,000
3000 48075 300,000
4000 48075 400,000
5000 48075 500,000

In this example, square footage and zip code are my features (inputs) and price is my label (output).  I could train a model on some data like the above, and then use the trained model to predict prices, given only a square footage and zip code. 

Classification
Scenario: Predict whether a woman will develop breast cancer or not, given various health history/medical information about the woman

Classification is also a type of supervised learning, like regression.  So again, using labeled data with existing mappings, we can create a predictive model.  The difference between classification and regression is that classification categorizes into discrete buckets (like yes or no on whether a woman will develop breast cancer), and regression predicts values on a continuum (so in the example above, it could predict that a house of 3500 square feet in the 48075 zip code would cost $350,000, even though that is not given as a sample price in the training data). 

There is support for two-class and multiclass classifiers.  A two-class classifier will predict between two options (yes or no on whether a woman will develop breast cancer, whether someone is lying or telling the truth, whether someone survived or died on the Titanic, etc.).  A multiclass classifier can predict between 3 or more categories (which basketball team will win the NCAA March Madness tournament, which A-Z letter a handwritten sample corresponds to, etc.). 

Anomaly Detection
Scenario: Predict that someone is wrong in an oil or gas pipeline due to unusual data (flow rate, temperature, etc.) outside of the norm

Anomaly detection is a specialized form of supervised learning, where we are accounting for the fact that one bucket is very rare.  For example, consider the problem where we are trying to predict if a given credit card transaction is fraudulent or not.  The database probably has a huge number of non-fraudulent transactions to train from, and (hopefully) only a very small number of known fraudulent transactions relative to the size of the entire dataset.  In anomaly detection, we train on the “normal” dataset and can detect unusual patterns outside of that norm. 

Now what?

Hopefully from the problem you want to solve, you can figure out which category of algorithm from the ones described above would work best.  Now, the problem becomes figuring out which specific algorithm to use within that category (especially if you are doing a regression or classification problem, which have lots of options in Azure Machine Learning). 

There are great resources available to help with this.  I recommend that you first look at the Azure Machine Learning Cheat Sheet.  It is a useful flowchart that helps you analyze your data and figure out which algorithm may perform best.  However, keep in mind that there is some art to this – definitely try multiple algorithms and see which one gives the best results for your particular data set.  There are other factors to keep in mind when choosing an algorithm as well: do you want to be able to incrementally update your trained model with new data?  Is accuracy or training time more important to you?  How large is your data set and how many features does each data point have?  This article does a great job discussing some of the nuances. 

Download the Azure Machine Learning Cheat Sheet from https://aka.ms/AzureMachineLearningCheatSheet

Microsoft Azure Machine Learning Cheat Sheet

Finally, don’t forget that each algorithm contains a number of initial parameters.  Tweaking the initial parameters can greatly improve your results.  The "Sweep Parameters" module can help by trying many different input parameters for you, and you can specify the metric that you want to optimize for (such as accuracy, precision, recall, etc.).  See this article for a brief description.