Detecting In-App Purchase Fraud with Machine Learning


Hacking applications allow users to make in-app purchases for free. With help from a few big games in the GROW data network we were able to build a model that classifies each purchase as real or fraud, with a very high level of accuracy.



By Ella Gati, Soomla.

detecting-fraudHacking applications such as Freedom, iAP Cracker, iAPFree, etc. allow users to make in-app purchases for free. With these kinds of hacks the player receives the coins, gems, levels or lives they purchased without paying any money. If the game developer did not implement any validation process on the in-app purchases, such as SOOMLA’s fraud protection, the purchases are recorded as real purchases in his system. As a result, the reported revenue may differ greatly from the real revenue (especially in popular games with lots of fraud).

We would like to make reports as accurate as possible, and to be able to communicate to the game developers the real state of their game. We use machine learning and statistical modeling techniques for our solution.

With help from a few big games in the GROW data network we were able to build a model that classifies each purchase as real or fraud, with a very high level of accuracy.

In-app purchase model features

The model uses a variety of purchase, user and item features.  The following table details a partial list of the features we computed:

Purchase User Item
Date of purchase Total number of purchases Total number of purchases
Time of purchase Total revenue Total revenue
Country from which the purchase was made Average revenue per day Average number of purchases per day
Currency in which the purchase was made Number of games the user played Average revenue per day
Whether the phone locale matches the country Number of games in which the user purchased Maximum number of purchases per day
Whether the currency of the purchase matches the currency in the country Whether the user was ever blocked by receipt verification Maximum revenue per day

Decision trees to the rescue

Decision trees, as their name suggests, are trees that help decision making. Each internal node of the tree tests the value of one feature, and leaf nodes are target classes. Given a new observation, the tree can be used to decide what class should be assigned to it.

In our case the tree can have two kinds of leaf nodes (classes): fraud or no-fraud, and the features are the ones detailed above. Examples for internal nodes can be “Total number of purchases > 100” or “currency matches country = true”.  To avoid overfitting the training data, tree-based techniques combine multiple trees to get a final output that is more accurate than each individual tree output.

Tree based classification algorithms have many advantages, to name a few:

  • Nonlinear relationships between parameters do not affect tree performance.
  • Decision trees implicitly perform variable screening or feature selection.
  • Decision trees require relatively little data preparation and are easy to interpret and understand.

We experimented with two tree-based classifiers. A random forest classifier is an ensemble of decision trees trained on subsets of the data, that outputs the class that is the mode of the classes output of the individual trees. Boosted trees combine multiple decision trees using the gradient boosting technique, fitting a weighted additive expansion of simple trees.

Another model parameter is the class weights, that have two forms, uniform in which all class get a weight of 1, and by-class that uses the relative size of the class out of the full population as the class weight.

Fraud classification performance

We measure the performance of our model by four measures:

  • Accuracy: ratio of correct classifications out of all test data.
  • False Positive rate (FPR): ratio of valid purchases wrongly classified as fraud out of all valid purchases.
  • False Negative rate (FNR): ratio of fraud purchases wrongly classified as non-fraud out of all fraud purchases.
  • F1 score: harmonic mean of precision and recall, a measure that comes from the information retrieval world, and conveys the balance between the other two measures.

Classifying valid purchases and users as fraud is a much worse mistake than missing a fraud purchase, thus we aim at reducing FPR to minimum, even at a cost of having a slightly higher FNR.

Our ground truth data includes purchases from 4 games. For the largest of them we have labels for 145K purchases. For both algorithms parameters were tuned using cross-validation.

Per game model

For the first experiment we built a different model for each game. The following table details the performance for different games and model parameters.

Game Classifier Class weights FPR FNR F1 score Accuracy
1 Random forest uniform 0.22 0.03 0.95 0.93
1 Random forest by class 0.10 0.08 0.95 0.92
1 Boosted trees uniform 0.09 0.03 0.97 0.96
1 Boosted trees by class 0.05 0.05 0.96 0.95
2 Random forest uniform 0.02 0.16 0.89 0.85
2 Random forest by class 0.01 0.16 0.91 0.82
2 Boosted trees uniform 0.01 0.11 0.94 0.84
2 Boosted trees by class 0.05 0.05 0.90 0.85

These results are very impressive! Per game model works very well with as much as 97% F1 score.

The second game has less ground truth data (it is a game with 100 times less purchases per month on average, and we got one month of purchases data from them), which explains the lower performance.

Boosted trees outperform the random forest algorithm, which is not surprising, since it is an optimization that normally gives you better accuracy with less trees.

Using weights tuned by the class size usually results in a lower FPR and higher FNR, with a slightly lower F1 score. As stated before, we care more about our false positive rate, so for following experiments we use boosted trees algorithm with non-uniform weights.

Cross-games classification

We have seen that we can get very good results when we build a model for a specific game. But we have ground truth data only for four games. What about the rest of the games?

To test that, the second experiment was conducted with a train set that contains data from one game and test set from a different game.

AccuracyHeatMap

The above table shows the accuracy for the cross-game experiments. All scores are 79% accurate or higher! This is great news for all of the other games.  As expected, the highest scores are achieved when the train and the test set come from the same game data (70%-30% random split). The lowest scores are when testing on the 4th game, which is the smallest of them (200 purchases).

Another interesting result is the FPR scores in this experiment.

FPRHeatMap

It also stands out that the model trained on game 4 is generating poor FPR scores. This is due to the small number of purchases, and the relatively low amount of fraud (54% compared to 77%-85% in other games). As game 1 has the largest ground truth data, models trained on the other (and smaller) games have a very high false positive rate, up to 30% of the valid purchases are classified as fraud. When training on other games we get much better results with 1-2% wrongly classified valid purchases.

This experiment has proved that data transfer between games works well in most cases, but can be problematic if your game has a very unique user behavior.

Results

Finally, we trained a model on all of our ground truth data and used it to classify all purchases in our data base. According to the results of of the classifier, 55.7% of purchases are fraud, and these purchases constitute 72.9% of the total revenue.

Fraud percentage by game size - SOOMLA

These numbers vary between different games. We can see a general trend of highest fraud percentage with bigger games (games with more users), even though we also see relatively small games with up to 89% fraud. The differences can be explained by different economy models, or popularity of the game in different countries.

According to our model results, fraud is most widespread in Slavic countries. Russia, Ukraine and Belarus are on the top of the list, with over 90% of purchases being made fraudulently.

Fraud rate per country - SOOMLA

The model predicts that only 2% of the users have some valid purchases and some fraud. The other 98% of users are either fraudsters (always commit fraud) or not (all purchases are valid). Out of the 98%, over half are fraudsters.

Implications for game developers

Knowing which users are fraudsters enables game developers to adapt game play and take restrictive action to ensure minimum lost revenue.  Some options are:

  • Blocking in-app purchases altogether for a specific user.
  • Increasing game difficulty as a means of stalling the user’s non-legitimate progression made with hacked in-game coins.
  • Increasing ad frequency to maximize revenue from abusive users who will never pay.
  • Bricking the game e.g. disabling all gameplay with a prominent warning message to the user requesting an immediate in-app purchase deposit to unlock the game.

How can we improve?

The more ground truth we have, the better our classification results will be. Game developers and studios can get better reports and help us improve by giving your feedback or sharing your sales reports with us.

Questions? Contact ella@soom.la.

This article first appeared on Soomla blog.

Related: