KDnuggets Home » News » 2015 » Aug » News, Features » Systematic Fraud Detection Through Automated Data Analytics in MATLAB ( 15:n29 )

Systematic Fraud Detection Through Automated Data Analytics in MATLAB


Fraud detection is one of the most challenging use case considering the number of factors it depend on. Here, we demonstrate how using hedge fund data in MATLAB you can automate the process of acquiring and analyzing fraud detection data.



By Jan Eggers, MathWorks.

As the Madoff Ponzi scheme and recent high-profile rate-rigging scandals have shown, fraud is a significant threat to financial organizations, government institutions, and individual investors. Financial services and other organizations have responded by stepping up their efforts to detect fraud.

Systematic fraud detection presents several challenges. First, fraud detection methods require complex investigations that involve the processing of large amounts of heterogeneous data. The data is derived from multiple sources and crosses multiple knowledge domains, including finance, economics, business, and law. Gathering and processing this data manually is prohibitively time-consuming as well as error-prone. Second, fraud is “a needle in a haystack” problem because only a very small fraction of the data is likely to be coming from a fraudulent case. The vast quantity of regular data—that is, data produced from non-fraudulent sources—tends to blend out the cases of fraud. Third, fraudsters are continually changing their methods, which means that detection strategies are frequently several steps behind.

Using hedge fund data as an example, this article demonstrates how MATLAB® can be used to automate the process of acquiring and analyzing fraud detection data. It shows how to import and aggregate heterogeneous data, construct and test models to identify indicators for potential fraud, and train machine learning techniques to the calculated indicators to classify a fund as fraudulent or non-fraudulent.

The statistical techniques and workflow described are applicable to any area requiring detailed analysis of large amounts of heterogeneous data from multiple sources, including data mining and operational research tasks in retail and logistic analysis, defense intelligence, and medical informatics.

The Hedge Fund Case Study

The number of hedge funds has grown exponentially in recent years: The Eureka hedge database indicates a total of approximately 20,000 active funds worldwide.1 Hedge funds are minimally regulated investment vehicles and, therefore, prime targets of fraud. For example, hedge fund managers may fake return data to create the illusion of high profits and attract more investors.

We will use monthly returns data from January 1991 to October 2008 from three hedge funds:

  • Gateway Fund
  • Growth Fund of America
  • Fairfield Sentry Fund

The Fairfield Sentry Fund is a Madoff fund known to have reported fake data. As such, it offers a benchmark for verifying the efficacy of fraud detection mechanisms.

Gathering Heterogeneous Data

Data for the Gateway Fund can be downloaded from the Natixis web site as a Microsoft® Excel® file containing the net asset value (NAV) of the fund on a monthly basis. Using the MATLAB Data Import Tool, we define how the data is to be imported (Figure 1). The Data Import Tool can automatically generate the MATLAB code to reproduce the defined import style.

Figure 1. The MATLAB Data Import Tool for interactively importing data from files.

Figure 1. The MATLAB Data Import Tool for interactively importing data from files.

After importing the NAV for the Gateway Fund, we use the following code to calculate the monthly returns:
% Calculate monthly returns
gatewayReturns = tick2ret(gatewayNAV);

For the Growth Fund of America, we use Datafeed Toolbox(tm) to obtain data from Yahoo! Finance, specifying the ticker symbol for the fund (AGTHX), the name of the relevant field (adjusted close price), and the time period of interest:

% Connect to yahoo and fetch data
c=yahoo;
data = fetch(c, 'AGTHX', 'Adj Close', startDate, endDate);

Unfortunately, Yahoo does not provide data for the period from January 1991 to February 1993. For this time period, we have to collect the data manually.

Using the financial time series object in Financial Toolbox™, we convert the imported daily data to the desired monthly frequency:

%Convert to monthly returns
tsobj = fints(dates, agthxClose);
tsobj = tomonthly(tsobj);

Finally, we import reported data from the Fairfield Sentry fund. We use two freely available Java™ classes, PDFBox and FontBox, to read the text from the pdf version of the Fairfield Sentry fund fact sheet:


% Instantiate necessary classes
pdfdoc = org.apache.pdfbox.pdmodel.PDDocument;
reader = org.apache.pdfbox.util.PDFTextStripper;

% Read data
pdfdoc = pdfdoc.load(FilePath);
pdfstr = reader.getText(pdfdoc);

Having imported the text, we extract the parts containing the data of interest—that is, a table of monthly returns.

Some tests for fraudulent data require comparison of the funds’ returns data to standard market data. We import the benchmark data for each fund using the techniques described above.

Once the data is imported and available, we can assess its consistency—for example, by comparing the normalized performance of all three funds (Figure 2).

Figure 2. Plot comparing the performance of the funds under consideration.

Figure 2. Plot comparing the performance of the funds under consideration.

Simply viewing the plot allows for a qualitative assessment. For example, the Madoff fund exhibits an unusually smooth growth, yielding a high profit. Furthermore, there are no obvious indications of inconsistency in the underlying data. This means that we will be able to use formal methods to detect fraudulent activities.