Systematic Fraud Detection Through Automated Data Analytics in MATLAB
Fraud detection is one of the most challenging use case considering the number of factors it depend on. Here, we demonstrate how using hedge fund data in MATLAB you can automate the process of acquiring and analyzing fraud detection data.
Analyzing the Returns Data
Since misbehavior or fraud in hedge funds manifests itself mainly in misreported data, academic researchers have focused on devising methods to analyze and flag potentially manipulated fund returns. We compute metrics introduced by Bollen and Pool2 and use them as potential indicators for fraud on the reported hedge fund returns. For example:
- Discontinuity at zero in the fund’s returns distribution
- Low correlation with other assets, contradicting market trends
- Unconditional and conditional serial correlation, indicating smoother than expected trends
- Number of returns equal to zero
- Number of negative, unique, and consecutive identical returns
- Distribution of the first digit (Does it follow Benford’s law?) and the last digit (Is it uniform?) of reported returns
To illustrate the techniques, we will focus on discontinuity at zero.
Testing for Discontinuity at Zero
Since funds with a higher number of positive returns attract more capital, fund managers have an incentive to misreport results to avoid negative returns. This means that a discontinuity at zero can be a potential indicator for fraud.
One test for such a discontinuity is counting the number of return observations that fall in three adjacent bins, two to the left of zero and one to the right. The number of observations in the middle bin should approximately equal the average of the surrounding two bins. A significant shortfall in the middle bin observations must be flagged.
Figure 3 shows the histograms of the funds’ returns, with the two bins around zero highlighted. Green bars indicate no flag, and red bars indicate potential fraud. Only the Madoff fund did not pass this test.
Figure 3. Histograms of monthly returns for funds under consideration.
Results for Funds Under Consideration
Applying all the tests described above to the present data yields a table of indicators for each fund (Figure 4).
Figure 4. Test results for funds under consideration. Red boxes indicate results that raised a flag.
The Madoff fund raised a flag in nine out of ten tests, but the other two funds also raised flags. Positive test results do not prove that a given hedge fund was involved in fraudulent activities. However, a table like the one shown in Figure 4 indicates funds that merit further investigation.
Classifying Analysis Results with Machine Learning
We now have a set of flags that can be used as indicators for fraud. Automating the analytics enables us to review larger data sets and to use the computed flags to categorize funds as fraudulent or non-fraudulent. This classification problem can be addressed using machine learning methods—for example, bagged decision trees, using the TreeBagger algorithm in Statistics and Machine Learning Toolbox™. The TreeBagger algorithm will require data for supervised learning to train the models. Note that our example uses data for only three funds. Applying bagged decision trees or other machine learning methods to an actual problem would require considerably more data than this small, illustrative set.
We want to build a model to classify funds as fraudulent or non-fraudulent, applying the indicators described in the section “Analyzing the Returns Data” as predictor variables. To create the model, we need a training set of data. Let us consider M hedge funds that are known as fraudulent or non-fraudulent. We store this information in the M-by-1-vector yTrain and compute the corresponding MxN-matrix xTrain of indicators. We can then create a bagged decision tree model using the following code:
% Create fraud detection model based on training data
fraudModel = TreeBagger(nTrees,xTrain,yTrain);
where nTrees is the number of decision trees created based on bootstrapped samples of the training data. The output of the nTrees decision trees is aggregated into a single classification.
Now, for a new fund, the classification can be performed by
% Apply fraud detection model to new data
isFraud = predict(fraudModel, xNew);
We can use the fraud detection model to classify hedge funds based purely on their returns data. Since the model is automated, it can be scaled to a large number of funds.
The Bigger Picture
This article outlines the process of developing a fully automated algorithm for fraud detection based on hedge fund returns. The approach can be applied to a much larger data set using large-scale data processing solutions such as MATLAB Distributed Computing Server™ and Apache™ Hadoop®. Both technologies enable you to cope with data that exceeds the amount of memory available on a single machine.
The context in which the algorithm is deployed depends largely on the application use cases. Fund-of-funds managers working mostly with Excel might prefer to deploy the algorithm as an Excel add-In. They could use the module to investigate funds under consideration for future investments. Regulatory authorities could integrate a fraud detection scheme into their production systems, where it would periodically perform the analysis on new data, summarizing results in an automatically generated report.
We used advanced statistics to compute individual fraud indicators, and machine learning to create the classification model. In addition to the bagged decision trees discussed here, many other machine learning techniques are available in MATLAB, Statistics and Machine Learning Toolbox, and Neural Network Toolbox™, enabling you to extend or alter the proposed solution according to the requirements of your project.
2 Bollen, Nicolas P. B., and Pool, Veronika K.. “Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud”(November 2011). http://www2.owen.vanderbilt.edu/nick.bollen/
Published 2014 – 92196v00
Original. Reposted by permission. MathWorks retains full copyright of this paper.