Choosing the Right Metric for Evaluating Machine Learning Models — Part 2
This will focus on commonly used metrics in classification, why should we prefer some over others with context.
By Alvira Swalin, University of San Francisco
In the first blog, we discussed some important metrics used in regression, their pros and cons, and use cases. This part will focus on commonly used metrics in classification, why should we prefer some over others with context.
Definitions
Let’s first understand the basic terminology used in classification problems before going through the pros and cons of each method. You can skip this section if you are already familiar with the terminology.
Source of Image: Wikipedia
 Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives TP/(TP+FN)
 Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives TN/(TN+FP)
 Precision: Number of items correctly identified as positive out of total items identified as positive TP/(TP+FP)
 False Positive Rate or Type I Error: Number of items wrongly identified as positive out of total true negatives FP/(FP+TN)
 False Negative Rate or Type II Error: Number of items wrongly identified as negative out of total true positives FN/(FN+TP)
Source of Image: Effect Size FAQs by Paul Ellis
 Confusion Matrix
 F1 Score: It is a harmonic mean of precision and recall given by F1 = 2*Precision*Recall/(Precision + Recall)
 Accuracy: Percentage of total items classified correctly (TP+TN)/(N+P)
ROCAUC Score
The probabilistic interpretation of ROCAUC score is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. Here, rank is determined according to order by predicted values.
Source of Image: UNC Lecture
Mathematically, it is calculated by area under curve of sensitivity (TPR) vs. FPR(1specificity). Ideally, we would like to have high sensitivity & high specificity, but in realworld scenarios, there is always a tradeoff between sensitivity & specificity.
Some important characteristics of ROCAUC are:
 The value can range from 0 to 1. However auc score of a random classifier for balanced data is 0.5
 ROCAUC score is independent of the threshold set for classification because it only considers the rank of each prediction and not its absolute value. The same is not true for F1 score which needs a threshold value in case of probabilities output
LogLoss
Logloss is a measurement of accuracy that incorporates the idea of probabilistic confidence given by following expression for binary class:
It takes into account the uncertainty of your prediction based on how much it varies from the actual label. In the worst case, let’s say you predicted 0.5 for all the observations. So logloss will become log(0.5) = 0.69. Hence, we can say that anything above 0.6 is a very poor model considering the actual probabilities.
Case 1
Comparison of Logloss with ROC & F1
Consider Case 1 (Balanced Data), it looks like model 1 is doing a better job in predicting the absolute probabilities whereas model 2 is working best in ranking observations according to their true labels. Let’s verify with the actual score:
If you consider logloss, Model 2 is worst giving a high value of logloss because the absolute probabilities have big difference from actual labels. But this is in complete disagreement with F1 & AUC score, according to which Model 2 has 100% accuracy. Also, you would like to note that with different thresholds, F1 score is changing, and preferring model 1 over model 2 for default threshold of 0.5.
Inferences drawn from the above example (balanced):
 If you care for absolute probabilistic difference, go with logloss
 If you care only for the final class prediction and you don’t want to tune threshold, go with AUC score
 F1 score is sensitive to threshold and you would want to tune it first before comparing the models
Case 2
How each of them deals with class imbalance?
The only difference in the two models is their prediction for observation 13 & 14. Model 1 is doing a better job in classifying observation 13 (label 0) whereas Model 2 is doing better in classifying observation 14 (label 1). The goal is to see which model actually captures the difference in classifying the imbalanced class better (class with few observations, here it is label 1). In problems like fraud detection/spam mail detection, where positive labels are few, we would like our model to predict positive classes correctly and hence we will sometime prefer those model who are able to classify these positive labels
Clearly logloss is failing in this case because according to logloss both the models are performing equally. This is because logloss function is symmetric and does not differentiate between classes .
Both F1 score and ROCAUC score is doing better in preferring model 2 over model 1. So we can use both these methods for class imbalance. But we will have to dig further to see how differently they treat class imbalance.
In the previous example, we saw that there were few positive labels. In the second example, there were few negative labels. Let’s see how F1 score & ROCAUC differentiate between these two cases.
ROCAUC score handled the case of few negative labels in the same way as it handled the case of few positive labels. An interesting thing to note here is that F1 score is pretty much same for both Model 3 & Model 4 because positive labels are large in number and it cares only for the misclassification of positive labels.
Inferences drawn from above example:
 If you care for a class which is smaller in number independent of the fact whether it is positive or negative, go for ROCAUC score.
When will you prefer F1 over ROCAUC?
When you have a small positive class, then F1 score makes more sense. This is the common problem in fraud detection where positive labels are few. We can understand this statement with the following example.
We can see that model (1) predicts 5 positives out of 100 true positives in a dataset of size 10K observations, while another model (2) predicts 90 positives out of 100 true positives. Clearly, model (2) is doing a much better job than model (1) in this case. Let’s see if both F1 score & ROCAUC score are able to capture that difference
F1 score for model (1) = 2*(1)*(0.1)/1.1 = 0.095
F1 score for model (2) = 2*(1)*(0.9)/1.9 = 0.947
Yes, the difference in F1 score reflects the model performance.
ROCAUC for model (1) = 0.5
ROCAUC for model (2) = 0.93
ROCAUC gives a decent score to model 1 as well which is nota good indicator of its performance. Hence we should be careful while picking rocauc for imbalanced datasets.
Which metric should you use for multiclassification?
We have further three types of nonbinary classification:
 MultiClass: classification task with more than two classes such that the input is to be classified into one, and only one of these classes. Example: classify a set of images of fruits into any one of these categories — apples, bananas, and oranges.
 Multilabels: classifying a sample into a set of target labels. Example: tagging a blog into one or more topics like technology, religion, politics etc. Labels are isolated and their relations are not considered important.
 Hierarchical: each category can be grouped together with similar categories, creating metaclasses, which in turn can be grouped again until we reach the root level (set containing all data). Examples include text classification & species classification. For more details, refer this blog.
In this blog, we will cover only the first category.
https://www.sciencedirect.com/science/article/pii/S0306457309000259
As you can see in the above table, we have broadly two types of metrics microaverage & macroaverage, we will discuss the pros and cons of each. Most commonly used metrics for multiclasses are F1 score, Average Accuracy, Logloss. There is yet no welldeveloped ROCAUC score for multiclass.
Logloss for multiclass is defined as:
 In Microaverage method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and then apply them to get the statistics.
 In Macroaverage, you take the average of the precision and recall of the system on different sets
Microaverage is preferable if there is a class imbalance problem.
In the third part, I will be focussing on metrics used in unsupervised learning problems where it is even harder to quantify the correctness of a model without the presence of target variables. Stay tuned! In the meantime, check out my other blogs here!
References
 https://classeval.wordpress.com/simulationanalysis/rocandprecisionrecallwithimbalanceddatasets/
 https://en.wikipedia.org/wiki/Precision_and_recall
 https://www.sciencedirect.com/science/article/pii/S0306457309000259
 https://stats.stackexchange.com/questions/11859/whatisthedifferencebetweenmulticlassandmultilabelproblem
 https://datascience.stackexchange.com/questions/15989/microaveragevsmacroaverageperformanceinamulticlassclassificationsettin/16001
Bio: Alvira Swalin (Medium) is currently pursuing Master's in Data Science at USF, I am particularly interested in Machine Learning & Predictive Modeling. She is a Data Science Intern at Price (Fx).
Original. Reposted with permission.
Related:
 Choosing the Right Metric for Evaluating Machine Learning Models – Part 1
 CatBoost vs. Light GBM vs. XGBoost
 Machine Learning Model Metrics
Top Stories Past 30 Days

