This is the first of three articles about performance measures and graphs for binary learning models in Azure ML. Binary learning models are models which just predict one of two outcomes: positive or negative. These models are very well suited to drive decisions, such as whether to administer a patient a certain drug or to include a lead in a targeted marketing campaign.
This first article lays the foundation by covering several statistical measures: accuracy, precision, recall and F1 score, These measures require a solid understanding of the two types of prediction errors which we will also cover: false positives and false negatives.
In the second article we’ll discuss the ROC curve and the related AUC measure. We’ll also look at another graph in Azure ML called the Precision/Recall curve.
The final article will cover the threshold setting, and how to find the optimal value for it. As you will learn, this requires a good understanding of the cost of inaccurate predictions.
At the time of this writing, Azure ML is still in preview and most of the documentation you find on the internet appears to be written by data scientists for data scientists. This blog is focused on people who want to learn Azure ML, but do not have a PhD in data science.
When you build experiments in Azure ML, you need to score the used learning models and evaluate their performance to understand how well they perform and which model actually works best. Ideally you would want just one number which tells you how good a model is, but Azure ML provides 9 metrics plus a couple of charts. If you’re new to Azure ML and you’re not already a data scientist, this can be a bit intimidating.
You can usually rely on the AUC measure to select which model works best for a given data set. We will cover this further in the second article of this series. After you select a model. you will need to find the optimal threshold value for it, which, as we will see in the third article, you cannot do using the AUC measure. This first article will set the foundation with regard to statistical measures in Azure ML and set you up for the other articles.
The reason Azure ML provides multiple performance measures is that data sets differ in two key ways:
- The distribution of positives and negatives, that is, the ratio between positive values and negative values in the data. Data scientists call this the class distribution. More often than not the class distribution of positive and negative values is not 50/50.
- The cost of wrong observations. When data scientists say “observation”, they mean the predicted value. It is important to realize there are two types of errors – false positives and false negatives - which often have a different associated cost. For example, the cost of not administering a drug to someone who needs it might be different than the cost of administering the drug to someone who does not need it.
Binary models just predict one of two values. The meaning of these values could be anything, depending on the data. For example, whether a tested device is defective, if a person is over 50 years old, or whether a certain stock will increase in value. To generalize this, statisticians and data scientists use the terms “positive” and “negative”, rather than true/false or 0/1 as a software developer would do. We will follow this convention.
The decision of which value to assign the “positive” label is not arbitrary. You should assign “positive” to the value that drives your decision. For example, if you are going to run a marketing campaign targeted towards people over 50 years old, and your model will predict if someone is over 50 years old, you should assign “>=50 years” as the positive value. Or, if you build a model to detect defective devices so you can rework them before shipping to the customer. you should define “is defective” as the positive value. If you don't do this right, many statistical measures, including precision and recall, will make little sense.
Accessing performance metrics in Azure ML
Azure ML has a large number of sample models. One you can use to follow along in called “Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset”.
To view the performance metrics, right-click on the output port of the Evaluate Model item and select Visualize.
After you click the Visualize button, you’ll initially see the ROC plot. Scroll down to see the performance metrics:
Above, all measures except AUC can be calculated based on the left-most four numbers. I created a spreadsheet which allows you to experiment with these numbers. You can also use the spreadsheet to review the formulas for each of the metrics. Here’s what the spreadsheet looks like:
You can download the spreadsheet here: 5141.Binary Model Performance Metric Calculator.xlsx
Understanding false positives and false negatives
The performance of a trained model is determined by how good the observations (predictions) reflect the actual events. When you train a (supervised) model, you need a “labeled” data set which includes the actual values which you’re trying to predict. You set aside a subset of the labeled data so that, after you trained the model, you can evaluate the model against labeled data which was not used to train the model. By comparing the predicted values against the labeled ones, you end up with four bins:
|Event = positive||Event = negative|
|Prediction = positive||True Positive||False Positive|
|Prediction = negative||False Negative||True Negative|
True positives and true negatives are the observations which were correctly predicted, and therefore shown in green.
The terms “false positive” and “false negative” can be confusing. False negatives are observations where the actual event was positive. The way to think about it is that the terms refer to the observations and not the actual events. So if the term starts with “false”, the actual value is the opposite of the word that follows it.
Accuracy is perhaps the most intuitive performance measure. It is simply the ratio of correctly predicted observations.
In the screenshot you saw before, the two green cells are the correctly predicted observations: 2894+11750=14,644. The total number of events is 16,281, hence the accuracy is 14,644/16,281=0.899, or approximately 90%.
Using accuracy is only good for symmetric data sets where the class distribution is 50/50 and the cost of false positives and false negatives are roughly the same. It can be attractive at first because it’s intuitively easy to understand, however. you should not rely on it too much because most data sets are far from symmetric.
Example: you are building a model which predicts whether a device is defective. The class distribution is such that 6 in 1000 devices is truly defective (positive). A model which simply returns “negative” – i.e. not defective – all the time gets it right 99.4% of the time and therefore has an accuracy of 0.994, when in fact it never correctly identifies a defective device!
Precision looks at the ratio of correct positive observations.
The formula is True Positives / (True Positives + False Positives).
Note that the denominator is the count of all positive predictions, including positive observations of events which were in fact negative.
Extending the example we used before to explain the accuracy, let's assume we improved the model and it now predicts 8 out of 1000 devices being faulty. If only 5 of those 8 are truly defective the precision is 5/8 = 0.625.
Recall is also known as sensitivity or true positive rate. It’s the ratio of correctly predicted positive events.
Recall is calculated as True Positives / (True Positives + False Negatives).
Note that the denominator is the count of all positive events, regardless whether they were correctly predicted by the model.
Using the same example as before, we already knew that 6 out of 1000 devices are truly defective. The model correctly predicted 5 of them. (It also predicted 3 incorrectly, but for the recall measure that’s not important). The recall therefore is 5/6 = 0.833.
Comparing Precision and Recall
To better understand the difference between precision and recall, here is a screenshot with all the stats:
Both precision and recall work well if there’s an uneven class distribution as is often case. They both focus on the performance of positives rather than negatives, which is why it’s important to correctly assign the “positive” predicate to the value of most interrest.
The precision measure shows what percentage of positive predictions where correct, whereas recall measures what percentage of positive events were correctly predicted. To put it in a different way: precision is a measure of how good predictions are with regard to false positives, whereas recall is measures how good the predictions are with regard to false negatives. Whichever type of error is more important – or costs more - is the one that should receive most attention.
As we will further explore in the third article, you can influence precision and recall by changing the threshold of the model. Based on the cost of false positives and false negatives, you can calculate the optimal threshold value.
The F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. It works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.
The formula for F1 Score is 2*(Recall * Precision)/(Recall + Precision)
Putting it together
The table below shows you which of the measures described in this article probably best to focus on, depending on the class distribution and the cost of false positives and false negatives:
In addition to the measures mentioned in this post, you might encounter others in literature such as “negative recall”. With the basic understanding of this article you should be able to figure out what they mean and when they are to be used. If you correctly assign the value that is of most interest as “positive”, you usually won't need those other measures anyway.
I hope you found this useful. Please leave comments or send me an email if you think I missed something important or if you have any other questions or feedback about this topic.