Introduction to Fraud Detection Systems
Using the Python gradient boosting library LightGBM, this article introduces fraud detection systems, with code samples included to help you get started.
By Miguel Gonzalez-Fierro, Microsoft.
Fraud detection is one of the top priorities for banks and financial institutions, which can be addressed using machine learning. According to a report published by Nilson, in 2017 the worldwide losses in card fraud related cases reached 22.8 billion dollars. The problem is forecasted to get worse in the following years, by 2021, the card fraud bill is expected to be 32.96 billion dollars.
In this tutorial, we will use the credit card fraud detection dataset from Kaggle, to identify fraud cases. We will use a gradient boosted tree as a machine learning algorithm. And finally, we will create a simple API to operationalize (o16n) the model.
Fraud detection problems are known for being extremely imbalanced. Boosting is one technique that usually works well with these kind of datasets. It iteratively creates weak classifiers (decision trees) weighting the instances to increase the performance. In the first subset, a weak classifier is trained and tested on all the training data, those instances that have bad performance are weighted to appear more in the next data subset. Finally, all the classifiers are ensembled with a weighted average of their estimates.
In LightGBM, there is a parameter called
is_unbalanced that automatically helps you to control this issue.
LigtGBM can be used with or without GPU. For small datasets, like the one we are using here, it is faster to use CPU, due to IO overhead. However, I wanted to showcase the GPU alternative, which is trickier to install, in case anyone wants to experiment with bigger datasets.
To install the dependencies in Linux:
The first step is to load the dataset and analyze it.
For it, before continuing, you have to run the notebook data_prep.ipynb, which will generate the SQLite database.
5 rows × 31 columns
As we can see, the dataset is extremely imbalanced. The minority class counts for around 0.002% of the examples.
The next step is to split the dataset into train and test.
Training with LightGBM - Baseline
For this task we use a simple set of parameters to train the model. We just want to create a baseline model, so we are not performing here cross validation or parameter tunning.
Once we have the trained model, we can obtain some metrics.
In business terms, if the system classifies a fair transaction as fraud (false positive), the bank will investigate the issue probably using human intervention. According to a 2015 report from Javelin Strategy, 15% of all cardholders have had at least one transaction incorrectly declined in the previous year, representing an annual decline amount of almost $118 billion. Nearly 4 in 10 declined cardholders report that they abandoned their card after being falsely declined.
However, if a fraudulent transaction is not detected, effectively meaning that the classifier predicts that a transaction is fair when it is really fraudulent (false negative), then the bank is losing money and the bad guy is getting away with it.
A common way to use business rules in these predictions is to control the threshold or operation point of the prediction. This can be controlled changing the threshold value in
binarize_prediction(y_prob, threshold=0.5). It is common to do a loop from 0.1 to 0.9 and evaluate the different business outcomes.
O16N with Flask and Websockets
The next step is to operationalize (o16n) the machine learning model. For it, we are going to use Flask to create a RESTfull API. The input of the API is going to be a transaction (defined by its features), and the output, the model prediction.
To start the api execute
(fraud)$ python api.py inside the conda environment.
First, we make sure that the API is on
The fraud police is watching you
Now, we are going to select one value and predict the output.
Fraudulent transaction visualization
Now that we know that the main end point of the API works, we will try the /predict_map end point. It creates a real time visualization system for fraudulent transactions using websockets.
A websocket is a protocol intended for real-time communications developed for the HTML5 specification. It creates a persistent, low latency connection that can support transactions initiated by either the client or server. In this post you can find a detailed explanation of websockets and other related technologies.
/predict_map, the machine learning model evaluates the transaction details and makes a prediction. If the prediction is classified as fraudulent, the server sends a signal using
socketio.emit('map_update', location). This signal just contains a dictionary, called
location, with a simulated name and location of where the fraudulent transaction occurred. The signal is shown in
frauddetection.js. The websocket part is the following:
newLocation containing the location information, that is going to be saved in a global array called
mapLocations. This variable contains all the fradulent locations that appeared since the session started. Then there is a clearing process for amCharts to be able to draw the new information in the map and finally the array is stored in
map.dataProvider.images, which actually refresh the map with the new point. The variable
map is set earlier in the code and it is the amCharts object responsible for defining the map.
To make a query to the visualization end point:
Now you can go the map url (in local it would be http://localhost:5000/map) and see how the map is reshesed with a new fraudulent location every time you execute the previous cell. You should see a map like the following one:
Once we have the API, we can test its scalability and response time.
Here you can find a simple load test to evaluate the performance of your API. Please bear in mind that, in this case, there is no request overhead due to the different locations of client and server, since the client and server are the same computer.
The response time of 10 requests is around 300ms, so one request would be 30ms.
ERROR:asyncio:Creating a client session outside of coroutine client_session: aiohttp.client.ClientSession object at 0x7f16847333c8
Enterprise grade reference architecture for fraud detection
In this tutorial we have seen how to create a baseline fraud detection model. However, for a big company this is not enough.
In the next figure we can see a reference architecture for fraud detection, that should be adapted to the customer specifics. All services are based on Azure.
1) Two general data sources for the customer: real time data and static information.
2) A general database piece to store the data. Since it is a reference architecture, and without more data, I put several options together (SQL Database, CosmosDB, SQL Data Warehouse, etc) on cloud or on premise.
4) Model retraining using new data and a model obtained from the Model Management.
5) Operationalization layer with a Kubernetes cluster, which takes the best model and put it in production.
6) Reporting layer to show the results.
Original. Reposted with permission.
- Using GRAKN.AI to Detect Patterns in Credit Fraud Data
- AI for Fraud Detection – How does Mastercard do it? Learn how global leaders use AI
- Intuitive Ensemble Learning Guide with Gradient Boosting