Machine Learning Model Metrics

In this article we explore how to calculate machine learning model metrics, using the example of fraud detection. We'll see lots of different ways that we can try to understand just how good our learned model is.

By Jeff Smith.

Reactive Machine Learning Systems

Kangaroo Kapital is the largest credit card company in Australia. Animals across the continent use Kangaroo Kapital credit cards to make all of their daily purchases, racking up points in the company's reward system.

Since Australian animals have traditionally not worn much clothing, the challenges of carrying around cash are substantial. Only having to keep track of a single credit card is a big help for your average working wallaby. But, since no clothes means no pockets, even keeping track of one credit card can be problematic.

Cards are often misplaced leading to a problem with theft and fraudulent use. To fight credit card fraud, Kangaroo Kapital has engaged you to head up their fraud team and create a machine learning model that can correctly detect and predict fraud.

This article has been excerpted from Reactive Machine Learning Systems—get it for 40% off with code rmlskdn at
If you want to follow along at home, here’s the code repo with the code relevant to this article

Kangaroo Kapital has supplied relevant data, which we have divided up. We can use some of our data to train our models and the rest to test or evaluate any learned models. Listing 1 recaps an example model learning process, using Spark's MLlib. We'll start out without using much new functionality from MLlib. Instead, we'll focus on how the model learning process connects to the work at hand. In this example, we'll learn a binary classification model using logistic regression.

Logistic Regression

Logistic regression is a quite common model learning algorithm. It is a regression model used to predict categorical variables (e.g. fraudulent vs. non-fraudulent credit card charges). A deeper discussion of the details of the algorithm are beyond the scope of this article, but as usual, Wikipedia has a good introduction. When it comes to building machine learning systems, logistic regression has several advantages: it is widely implemented, there are efficient distributed implementations, the model size scales linearly with the number of features, the importance of features on the model is easily analyzable, and so on. In this case, using logistic regression allows us to use even more library functionality from MLlib to evaluate our learned model than is available for less popular or more sophisticated model learning algorithms.

Listing 1. Learning a Model

val session = SparkSession.builder.appName("Fraud Model").getOrCreate() (1)
import session.implicits._ (2)
val data ="libsvm").load("src/main/resources/sample_libsvm_data.txt") (3)
val Array(trainingData, testingData) = data.randomSplit(Array(0.8, 0.2)) (4)
val learningAlgo = new LogisticRegression() (5)
val model = (6)
println(s"Model coefficients: ${model.coefficients} Model intercept:
${model.intercept}") (7)


1. Creating a new session
2. Importing some useful implicit conversions for use with DatFrame s
3. Loading some sample data, stored in LibSVM format
4. Randomly splitting our sample data into our training and testing sets
5. Instantiating a new instance of a logistic regression classifier
6. Learning the model over the training set
7. Printing the parameters of the model for inspection

In this case, we'll use some standard sample data to stand in for our Kangaroo Kapital credit card data (found in the code repo). We can refactor this code later to ingest from our statically-typed transactional data. This sample data will just allow us to get the basics of our training and testing process set up quickly. Note that we're also using a less sophisticated method of splitting our data between training and testing than we did before. Again, this is just to give us a simple but runnable prototype that we can refactor to use our credit card data later. Both the sample data and the simple random train/test splitting function are provided by the Spark project to make getting started building models easier.

At the end of Listing 1 we produced an instance of a LogisticRegressionModel. We can now use some library functionality to inspect and reason about our model At this point in the process, we have absolutely no idea what our model is like. The outcome of the model learning process is definitionally uncertain; we could have a very useful model or complete garbage.

First, we can understand some of the metrics that can be computed about the model's performance on the training set. To do that, we need to discuss how it is we measure the performance of a classifier. In binary classification problems we often refer to the two classes as positive and negative. In the case of Kangaroo Kapital, the positive case would be that fraud had occurred, and the negative case would be that no fraud had occurred. A given classifier can then be scored on its performance within those classes. This is true whether the classifier is a machine learned model, a dingo deciding based on smell, or just a flip of an Australian dollar coin. The conventional terminology is to call correct predictions true and incorrect predictions false. Putting all of this together yields the two-by-two matrix shown in Figure 1, known as a confusion matrix.



So, a true positive is when the model predicted a fraud correctly. A false positive is when the model predicted a fraud incorrectly. A true negative is when the model predicted a normal (not fraudulent) transaction correctly. Finally, a false negative is when the model predicted a normal transaction incorrectly, and there was in fact fraud.

With these four statistics, we can calculate a number of statistics to help us evaluate models. First, we can evaluate the precision of a model. The precision of a model is defined as the number of true positives divided by the sum of all positive predictions (Listing 2).

Listing 2. Precision

precision = true positives / (true positives + false positives)


Precision is really important for Kangaroo Kapital. If the fraud model's precision isn't high enough, then they will spend all of their fraud investigation budget investigating normal, non-fraudulent transactions.

There is another statistic, called recall that is also important for the kangaroos. If the kangaroos' fraud model's recall isn't high enough, it will be too easy for animals to commit credit card fraud and never get caught, and that will get expensive.

Recall is defined as the number of true positives divided by the sum of all positives in the set (Listing 3).

Listing 3. Recall

recall = true positives / (true positives + false negatives)


Depending on the context, recall also goes by other names, such as the true positive rate. There is another statistic related to recall called the false positive rate or drop-out. The false positive rate is defined as the number of false positives divided by the sum of all negatives in the set (Listing 4).

Listing 4. False Positive Rate

false positive rate = false positives / (true negatives + false negatives)


You can understand how a model trades off the true positive rate (AKA recall) versus the false positive rate using a plot called an ROC curve.

NOTE: ROC Curves
The acronym ROC stands for Receiver Operating Characteristic. The technique and the name originate in work on radar during World War II. While the technique is still useful, the name is really has no relationship to its current common usage, so it is rarely referred to by anything other than the acronym ROC.

A typical ROC curve plot might look something like Figure 2.



The false positive rate is on the X-axis and the true positive rate is on the Y-axis. The diagonal line of x = y represents the expected performance of a random model, so a usable model's curve should be above that line.