The Best Metric to Measure Accuracy of Classification Models
Measuring accuracy of model for a classification problem (categorical output) is complex and time consuming compared to regression problems (continuous output). Let’s understand key testing metrics with example, for a classification problem.
iv) F1 score
F1 score incorporates both Recall and Precision and is calculated as,
F1 score = 2 * (Precision * Recall) / (Precision + Recall)
The F1 score represents a more balanced view compared to the above 3 metrics but could give a biased result in the scenario discussed later since it doesn’t include TN.
F1 score = 2 * (0.90 * 0.90) / (0.90 + 0.90) = 0.90
v) Matthews Correlation Coefficient (MCC)
Unlike the other metrics discussed above, MCC takes all the cells of the Confusion Matrix into consideration in its formula.
MCC = TP * TN – FP * FN / √ (TP +FP) * (TP + FN) * (TN + FP) * (TN + FN)
Similar to Correlation Coefficient, the range of values of MCC lie between -1 to +1. A model with a score of +1 is a perfect model and -1 is a poor model. This property is one of the key usefulness of MCC as it leads to easy interpretability.
MCC = 90*999,890 – 10*10 / √(90+10)*(90+10)*(999,890+10)*(999,890+10)
MCC = 0.90
We will test and compare the result of the classification model at few probability cut-off values using the above-mentioned testing metrics.
Scenario A: Confusion Matrix at cut-off value of 0.5
We shall take this scenario (cut-off value of 0.5) as the base case and compare the result of the base case with different cut-off values.
|TP = 90||FN = 10|
|FP = 10||TN = 999,890|
Scenario B: Confusion Matrix at cut-off value of 0.4
|TP = 90||FN = 10|
|FP = 1910||TN = 997,990|
It can be clearly observed that for Scenario B, there is a substantial increase in FP compared to Scenario A. Hence, there should be deterioration in the metrics.
There is no change in Sensitivity & Specificity, which is constant.
Scenario C: Confusion Matrix at cut-off value of 0.6
|TP = 90||FN = 1910|
|FP = 10||TN = 997,990|
There is a substantial increase in FN compared to Scenario A. Hence, there should be deterioration in the metrics compared to A.
Here there is no change in Specificity & Precision while there is a general decline in other metrics.
Based on our findings, we can say that F1 score and MCC is making more sense compared to Sensitivity and Specificity.
In the example, we have built a model to predict Fraud. We can use the same model to predict Non-Fraud. In such a case, the Confusion Matrix will be as given below:
Scenario D: Confusion Matrix at cut-off value of 0.5
|TP = 999,890||FN = 10|
|FP = 10||TN = 90|
The above confusion matrix is just the transpose of the matrix given in Scenario A since the model is predicting Non-Frauds instead of Frauds. So the True Negatives in Scenario A will be the True Positives for Scenario D, likewise for other cells. Ideally, the testing metrics should be the same for Scenario A and D.
Except for MCC all the other testing metrics have changed.
Summary of Testing Metrics for all the scenarios:
As an analyst, if you are looking at a metric to measure and maximize the overall accuracy of the classification model, MCC seems to the best bet since it is not only easily interpretable but also robust to changes in the prediction goal.
Original post. Reposted with permission.
Bio: Jacob Joseph works with CleverTap, a digital analytics, user engagement and personalization platform where he is an integral part leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class real time analytics to its customers.