Quadstone

KDD Cup 98:
Quadstone Take Bronze Miner Award

Decisionhouse

Quadstone's Decisionhouse is a complete and integrated suite of software for customer behaviour modelling. The Decisionhouse software incorporates all the necessary elements of database connectivity, analysis, statistics, visualisation, and data mining for business-focused customer modelling. Decisionhouse is used to understand customer behaviour for applications including:
  • propensity and response modelling
  • customer retention and churn prediction
  • database marketing
  • cross-selling target identification
  • customer profitability analysis
  • credit scoring.

For more information about Decisionhouse visit the Quadstone web pages at www.quadstone.com

Task

The dataset for this year's competition was provided by the Paralysed Veterans of America (PVA). The PVA provides programs and services for US veterans and is one of the largest direct mail fund raisers in the US.

The competition dataset consisted of 191,779 lapsed donor records who received a mailing as part of larger campaign sent to a total of 3.5 million donors. Lapsed donors were an important group as the longer someone goes without donating the less likely they are to donate again. It was therefore vital for the PVA to reactivate these donors.

Competitors were tasked with developing a model that would help the PVA maximize the net revenue generated from future renewal mailings to these lapsed donors. Competitors were assessed on the predicted net donations from the campaign.

Methodology

As the dataset had already been cleansed and pre-processsed, Quadstone's analysts started with data exploration using Decisionhouse's visualisation modules. This gave a good understanding of the dataset and allowed the analysts to identify correlations between significant variables. Decision trees and scorecards were then introduced for the more complex modelling tasks.

The final model was chosen by comparing a variety of modelling approaches applied to an initial set of 'best' inputs (selected based on exploratory analysis, decision trees and initial modelling) and looking at the difference in predicted net profitability (lift curve).

Figure 1

Figure 1
Click here for enlarged image

Figure 1 illustrates a regression tree predicting donation amount. The population used to build the decision tree includes only donors who responded. The red leaf node contains 5.7% of the population with a mean(donation amount) of $40.80 as opposed to $15.60 for the entire population (at the root node). The green nodes contain populations with similar means to the root note, whereas the blue nodes contain populations with smaller means than the root node.

We investigated modelling techniques including scorecards, decision trees, and linear regression, and looked both at modelling donation amount directly (e.g. aiming to predict $0.00 for non-responders) and by combining a response model with a model predicting donation conditioned on response. We also used a variety of visual techniques to explore promising derived variables and interaction effects (illustrated in Figures 2-5).

Figure 2

Figure 2

Figure 2 illustrates the correlation between days since last gift and days since first gift. The colour coding indicates mean(Response); the colour ranges from green to red where green corresponds to low response rate and red corresponds to high response rate. The height of each glyph refers to population count.

Figure 3

Figure 3

Figure 4

Figure 4

Figure 3 shows the entire population displayed on a map of the US, illustrating the different response rates across States. The height of the glyphs corresponds to the size of the population in each State, the colour coding being used to illustrate mean(Response). Figure 4 illustrates the different response rates across California, obtained using the drill down feature from the map displayed in Figure 3.

To ensure the model was not over-fitting, a test/training approach was used by splitting the learning set into two (equal) halves, building the model on the training set, and looking at performance on the test set as illustrated in Figure 5. The depth of the glyphs show the population count and the height of the gylphs and colour coding are used to illustrate mean(Response). The x-axis corresponds to the score from the scorecard model predicting response, whereby responders are assigned a higher score than non-responders. The y-axis corresponds to the populations used to build and validate the scorecard.

Figure 5

Figure 5

The final model combines a 'predicted donation given response' model (built using regression trees and direct regression) with a 'likelihood of response' model (based on an additive scorecard). Together these models give an unconditioned predicted donation amount.

The first component of the model is based on 6 variables relating primarily to previous gift amounts as illustrated in Figure 6. The second component is based on approximately 10 variables, including information about previous gift behaviour (time and amount), and demographic information (the majority of these fields are derived, i.e. not present in the original dataset). These variables are each classed into between 2-6 attributes.

Figure 6

Figure 6

Figure 6 illustrates the variables used in the predicted donation amount model. The population illustrated is donors who have made a donation in response to the most recent mailing. The colour coding refers to mean(donation amount) and height of bars corresponds to population count. Donors with larger LastGift, AvgGift or most recent donation values can be seen to have a higher donation amount.

The output from the final model (combining predicted donation given response) was used to rank the learning dataset (Figure 7). The learning set was then classed into 10 equal population groups using this ranking and the observed mean(Target_D) for each group was calculated for the learning dataset. The holdout dataset was then ranked in the same manner as the learning dataset and every record in each decile was assigned the mean(Target_D) value from the equivalent decile in the learning set (Figure 8). We did this primarily because of the specific success criterion used in the competition as this clearly reduces the utility of the prediction for ranking potential donors.

Figure 7

Figure 7

In Figure 7 the learning dataset was ranked by predicted donation amount. Colour and height correspond to mean(Target_D) and depth corresponds to population count.

Figure 8

Figure 8

Figure 8 shows the hold-out dataset ranked by predicted donation amount. The mean(Target_D) for each ranking bin was applied to the hold-out dataset with the same ranking. Colour and height correspond to mean(Target_D), which has been applied from the learning dataset and depth corresponds to population count.

Figure 9

Figure 9

Results

Mailing everyone predicted to donate more than $0.68 (57,836 people), resulted in actual profits on the hold-out data of $13,954. Figure 9 illustrates a lift curve, which shows the expected performance of the model with respect to a random model.

Conclusion

Quadstone's Decisionhouse has been designed to handle the entire analysis process from data preparation through to modelling, post-processing and model operationalisation. Decisionhouse normally has two routes to giving businesses the most valuable results from their data. The first is through its power to analyse very large volumes of data. The second through enabling the business analyst, primarily marketers, to interact with the data mining process using their business knowledge to guide the analysis. This result is clear confirmation that the Quadstone approach delivers world class models.




Copyright © 1998 Quadstone Limited