A simple and interpretable performance measure for a binary classifier

Binary classification tasks are the bread and butter of machine learning. However, the standard statistic for its performance is a mathematical tool that is difficult to interpret -- the ROC-AUC. Here, a performance measure is introduced that simply considers the probability of making a correct binary classification.

By Mehmet Suzen, Theoretical Physicist and Research Scientist.

The core application of machine learning models is a binary classification task. This appears in polyhedra of areas from medicine for diagnostic tests to credit risk decision making for consumers. Techniques in building classifiers vary from simple decision trees to logistic regression and lately super-cool deep learning models that leverage multilayered neural networks. However, they are mathematically different in construction and training methodology, when it comes to their performance measure, things get tricky. In this post, we propose a simple and interpretable performance measure for a binary classifier in practice. Some background in classification is assumed.


Why ROC-AUC is not interpretable?

Varying threshold produces different confusion matrices (Wikipedia).

The de-facto standard in reporting classifier performance is to use the Receiver Operating Characteristic (ROC) - Area Under Curve (AUC) measure. It originates from the 1940s during the development of Radar by the US Navy, in measuring the performance of detection.  There are at least 5 different definitions of what does ROC-AUC means, and even if you have a Ph.D. in Machine Learning, people have an excessively difficult time explaining what AUC means as a performance measure. As AUC functionality is available in almost all libraries, and it becomes almost like a religious ritual to report in Machine Learning papers as a classification performance. However, its interpretation is not easy, apart from its absurd comparison issues, see hmeasure.  AUC measures the area under the True Positive Rate (TPR) curve as a function of the False Positive Rate (FPR) that are extracted from confusion matrices with different thresholds.

f(x) = y

∫ 10 f(x)dx = AUC

where y is TPR and x is FPR. Apart from a multitude of interpretations and being easy to have confusion, there is no clear purpose of taking the integral over FPR. Obviously, we would like to have perfect classification by having FPR zero, but the area is not mathematically clear, which means that what is it as a mathematical object is not clear.


Probability of correct classification (PCC)

A simple and interpretable performance measure for a binary classifier would be great for both highly technical data scientist and non-technical stakeholders. The basic tenant in this direction is that the purpose of a classifier technology is the ability to differentiate two classes. This boils down to a probability value, Probability of correct classification (PCC). An obvious choice is the so-called balanced accuracy (BA). This is usually recommended for unbalanced problems, even by SAS; though they used multiplication of probabilities. Here we will call BA as PCC and use addition instead, due to statistical dependence:

PCC = (TPR + TNR) / 2

TPR = TP / (ConditionPositive) = TP / (TP + FN)
TNR = TN / (ConditionNegative) = TN / (TN + FP).

PCC tells us how good the classifier in detecting either of the class, and it is a probability value, [0,1]. Note that using total accuracy over both positive and negative cases is misleading, even if our training data is balanced in production, batches we measure the performance may not be balanced, so accuracy alone is not a good measure.


Production issues

The immediate question would be how to choose the threshold in generating a confusion matrix? One option would be to chose a threshold that maximizes PCC for production on the test set. To improve the estimation of PCC, resampling on the test set can be performed to get a good uncertainty.



We try to circumvent in reporting AUCs by introducing PCC, or balanced accuracy as a simple and interpretable performance measure for a binary classifier. This is easy to explain to a non-technical audience. An improved PCC, that takes into account better estimation properties can be introduced, but the main interpretation remains the same as the probability of correct classification. 

Original. Reposted with permission.