The Best Metric to Measure Accuracy of Classification Models
Measuring accuracy of model for a classification problem (categorical output) is complex and time consuming compared to regression problems (continuous output). Let’s understand key testing metrics with example, for a classification problem.
By Jacob Joseph, CleverTap.
Unlike evaluating the accuracy of models that predict a continuous or discrete dependent variable like Linear Regression models, evaluating the accuracy of a classification model could be more complex and time-consuming. Before measuring the accuracy of classification models, an analyst would first measure its robustness with the help of metrics such as AIC-BIC, AUC-ROC, AUC- PR, Kolmogorov-Smirnov chart, etc. The next logical step is to measure its accuracy. To understand the complexity behind measuring the accuracy, we need to know few basic concepts.
Most of the classification models output a probability number for the dataset.
E.g. – A classification model like Logistic Regression will output a probability number between 0 and 1 instead of the desired output of actual target variable like Yes/No, etc.
The next logical step is to translate this probability number into the target/dependent variable in the model and test the accuracy of the model. To understand the implication of translating the probability number, let’s understand few basic concepts relating to evaluating a classification model with the help of an example given below.
Goal: Create a classification model that predicts fraud transactions
Output: Transactions that are predicted to be Fraud and Non-Fraud
Testing: Comparing the predicted result with the actual results
Dataset: Number of Observations: 1 million; Fraud : 100; Non-Fraud: 999,900
The fraud observations constitute just 0.1% of the entire dataset, representing a typical case of Imbalanced Class. Imbalanced Classes arises from classification problems where the classes are not represented equally. Suppose you created a model that predicted 95% of the transactions as Non-Fraud, and all the predictions for Non-Frauds turn out to be accurate. But, that high accuracy for Non-Frauds shouldn’t get you excited since Frauds are just 0.1% whereas the Predicted Frauds constitute 5% of the observations.
Assuming you were able to translate the output of your model to Fraud/Non-Fraud, the predicted result could be compared to actual result and summarized as follows:
a) True Positives: Observations where the actual and predicted transactions were fraud
b) True Negatives: Observations where the actual and predicted transactions weren’t fraud
c) False Positives: Observations where the actual transactions weren’t fraud but predicted to be fraud
d) False Negatives: Observations where the actual transactions were fraud but weren’t predicted to be fraud
Confusion Matrix is a popular way to represent the summarized findings.
|True Positives (TP)||False Negatives (FN)|
|False Positives (FP)||True Negatives (TN)|
Typically, a classification model outputs the result in the form of probabilities as shown below:
First 5 rows of the dataset:
Suppose we assume 0.5 as the cut-off probability i.e. observations with probability value of 0.5 and above are marked as Fraud and below 0.5 are marked as Non-Fraud as shown in the table below:
Accordingly, the above first 5 rows will be as below:
Let’s summarize the results from the model of the entire dataset with the help of the confusion matrix:
|TP = 90||FN = 10|
|FP = 10||TN = 999,890|
We have all non-zero cells in the above matrix. So is this result ideal?
Wouldn’t we love a scenario wherein the model accurately identifies the Frauds and the Non-Frauds i.e. zero entry for cells, FP and FN?
A BIG YES.
Consider a scenario wherein as a marketing analyst; you would like to identify users who were likely to buy but haven’t bought yet. This particular class of users would be the ones who share the characteristics of the users who bought. Such a class would belong to False Positives – Users who were predicted to transact but didn’t transact in reality. Hence, in addition to non-zero entries in TP and TN, you would prefer a non-zero entry in FP too. Thus, the model accuracy depends on the goal of the prediction exercise.
Key Testing Metrics
Since we are now comfortable with the interpretation of the Confusion Matrix, let’s look at some popular metrics used for testing the classification models:
Sensitivity also known as the True Positive rate or Recall is calculated as,
Sensitivity = No. of True Positives / (No. of True Positives + No. of False Negatives)
Sensitivity = TP / (TP + FN)
Since the formula doesn’t contain FP and TN, Sensitivity may give you a biased result, especially for imbalanced classes.
In the example of Fraud detection, it gives you the percentage of Correctly Predicted Frauds from the pool of Actual Frauds.
Sensitivity = 90 / (90 + 10) = 0.90
Specificity, also known as True Negative Rate is calculated as,
Specificity = No. of True Negatives / (No. of True Negatives + No. of False Positives)
Specificity = TN / (TN + FP)
Since the formula does not contain FN and TP, Specificity may give you a biased result, especially for imbalanced classes.
In the example of Fraud detection, it gives you the percentage of Correctly Predicted Non-Frauds from the pool of Actual Non-Frauds.
Specificity = 999,890 / (999,890 + 10) = 1
Precision also known as Positive Predictive Value is calculated as,
Precision = No. of True Positives / (No. of True Positives + No. of False Positives)
Precision = TP / (TP + FP)
Since the formula does not contain FN and TN, Precision may give you a biased result, especially for imbalanced classes.
In the example of Fraud detection, it gives you the percentage of Correctly Predicted Frauds from the pool of Total Predicted Frauds.
Precision = 90 / (90 + 10) = 0.90