Understanding AzureML – Part 1: Regression

Article
10/29/2014

This is a pull from my blog website: www.IndieDevSpot.com Feel free to visit that site for more blog articles/posts or join the monthly newsletter.

To see the article on that website, please visit: https://indiedevspot.azurewebsites.net/2014/10/29/understanding-azureml-part-1-regression/

Hello World!

So I hope many of you have started using AzureML. If not, you should definitely check it out. Here is the link to the dev center for it. This article will focus on a few key points.

Understanding the Evaluation of each Model Type.
Understanding the published Web Service of each Model

If you are looking for how to build a simple how to get started, check out this article.

The series will be broken down into a three parts.

Part 1: Regression

Part 2: Binary Classification

Part 3: Multi-Class Classification

So lets get started!

About the Data Set Used

The data set used is new car data from the sample Azure ML data sets. We are predicting car price. I let the Feature Selection Module pick the best 5 features to use, which it determined are: Engine-Size, Curb-Weight, horse-power, width and highway-mpg.

Important Values

There are a few important values to get to understand.

These are the metrics for a typical linear regression. I have done my best to break them down to what each one is in common language, but I find with statistics that is very difficult.

Mean Absolute Error: The absolute value (no negatives) of the difference between the predicated values and the actual values. Can only be used to compare models using the same units.
Root Mean Squared Error: The sum of the differences squared and averaged. Can only be used to compare models using the same units.
Relative Absolute Error: Inverse of your coefficient of determination. This represents the percent inaccuracy of your model. Can be used across units.
Relative Squared Error: Similar to RAE, except each iteration has the top and bottom squared. Can be used across units.
Coefficient of Determination: This is the best indicator in my opinion. This number describes the percent accuracy of your model. In the above example, you can see the first is .844, which is actually pretty good. The second model is .905, which is even better! The average % our model is off by. 1 – (Sum of Squares Error / Sum of Squares Total)

So here I have decided to grab my key indicators and explain how to use them for the common person (like myself). I have put them in order of importance.

Metrics in importance order with whys

Coefficient of Determination: This is your number one indicator. This needs to be as close to 1 as possible. I try to shoot for .85 or better.
Mean Absolute Error: This is important, particularly for your particular units. Lets assume that you are attempting to predict anything in the billions, reproduction of bacteria for example. You may have a really good curve that is close on, however you could be off by a billion or so. This could be a real issue. When dealing with particularly large numbers, this can become even more apparent. This is not your curvature accuracy, but rather your numerical average accuracy.
Root Mean Squared Error: If there is large dispersion of data, this can sometimes be better than Mean Absolute Error. In that case, leverage this value in place of Mean Absolute Error.

That is all the values I use for regression as of today. Maybe I will use more, but for every situation I’ve run in to, if I can get those 3 values within my acceptance threshold, my predications are pretty darned good.

Understanding Published Web Services

Alright, I’ve seen a ton of articles about this that are just completely wrong. So lets set the record straight on how this actually works. This is broken down into a few parts.

Save your trained Model
Create your inputs/outputs
Publish the web service
Understanding your request/response.

Save your trained model

We can see from the metrics above that one model is significantly better than the other. Since the values that are better are on the bottom we will be picking the model that is the input on the right node of the evaluate module. In our situation, this is a decision forest. To save the trained model, you simply click the output node of the train model module and select “Save Trained Model”.

Building the Production System

Take note of your current inputs to the trained model. Write them down on a piece of paper. Then create a new experiment. Add your initial data set again, along with a project columns, score model. Instead of a new regression and train model module, add your trained model and pipe that into your Score Model. Your experiment should look similar to below.

For project columns, start with no columns and include everything you used to train your model EXCEPT the values you are attempting to predict. Note that you should have included your prediction value when training your model, but definitely not for your production system. If somebody provides the price, what is the point in predicting it?

Create your inputs/outputs

Right click the right node of the Score Model Module and select “Set as publish Input”. Right click the output node of the Score Model Module and select “Set as publish Output”. Run your experiment. It should look similar to below.

Creating and Understanding the Published Service

The publish web service button should now be available. Click it and name your service. If it is not available, you may have had an error in your run. Fix the error and run again.

The Request

This should be simple enough to understand if you get JSON (I hope you do if you are working with web requests).

1 2 3 4 5 6 7 8 9 10 11 12 13 { "Id": "score00001", "Instance": { "FeatureVector": { "width": "0", "curb-weight": "0", "engine-size": "0", "horsepower": "0", "highway-mpg": "0" }, "GlobalParameters": {} } }

The Response

For simple single value regression, the response is fairly straightforward to understand as well. It comes back as a string array in the order width, curb-weight, engine-size, horsepower, highway-mpg, Scored Labels. Where Scored Labels is the predicted value(s).

1	`["0","0","0","0","0","0"]`

Note that if you don’t want width, curb-weight, engine-size, horsepower or highway-mpg returned, after your Score Model module, you can project columns and exclude all except Scored Labels and the only return value will be the predicted value.

Summary

I hope you all enjoyed this article and found it helpful. Azure Machine Learning certainly reduced the bar to Machine Learning significantly, and I am extremely excited I only need to understand the gist of these metrics to produce powerful tools that can predict whatever I want. Keep in tune for parts 2 and 3!