KDD Cup 98:
Quadstone Take Bronze Miner Award
Decisionhouse
Quadstone's Decisionhouse is a complete and integrated suite of software for customer behaviour modelling.
The Decisionhouse software incorporates all the necessary elements of
database connectivity, analysis, statistics, visualisation, and data mining for
business-focused customer modelling. Decisionhouse is used to
understand customer behaviour for applications including:
- propensity and response modelling
- customer retention and churn prediction
- database marketing
- cross-selling target identification
- customer profitability analysis
- credit scoring.
For more information about Decisionhouse visit the Quadstone web pages at www.quadstone.com
Task
The dataset for this year's competition was provided by the Paralysed Veterans of America (PVA).
The PVA provides programs and services for US veterans and is one of the largest
direct mail fund raisers in the US.
The competition dataset consisted of 191,779 lapsed donor records who received a mailing as part of
larger campaign sent to a total of 3.5 million donors. Lapsed donors were an important group as the longer
someone goes without donating the less likely they are to donate again. It was therefore vital for the PVA
to reactivate these donors.
Competitors were tasked with developing a model that would help the PVA maximize the net
revenue generated from future renewal mailings to these lapsed donors. Competitors were assessed on
the predicted net donations from the campaign.
Methodology
As the dataset had already been cleansed and pre-processsed, Quadstone's analysts started with
data exploration using Decisionhouse's visualisation modules. This gave a good understanding of the dataset
and allowed the analysts to identify correlations between significant variables. Decision trees and scorecards were then introduced
for the more complex modelling tasks.
The final model was chosen by comparing a variety of modelling approaches applied to an initial set of 'best' inputs (selected based on
exploratory analysis, decision trees and initial modelling) and looking at the difference in
predicted net profitability (lift curve).
Figure 1
Click here for enlarged image
Figure 1 illustrates a regression tree predicting donation amount. The population used
to build the decision tree includes only donors who responded. The red leaf node
contains 5.7% of the population with a mean(donation amount) of $40.80 as opposed
to $15.60 for the entire population (at the root node). The green nodes contain
populations with similar means to the root note, whereas the blue nodes contain
populations with smaller means than the root node.
We investigated modelling techniques including scorecards, decision trees, and linear
regression, and looked both at modelling donation amount directly (e.g. aiming to
predict $0.00 for non-responders) and by combining a response model with a model
predicting donation conditioned on response. We also used a variety of visual
techniques to explore promising derived variables and interaction effects (illustrated in
Figures 2-5).
Figure 2
Figure 2 illustrates the correlation between days since last gift and days since first gift.
The colour coding indicates mean(Response); the colour ranges from green to red
where green corresponds to low response rate and red corresponds to high response
rate. The height of each glyph refers to population count.
Figure 3
Figure 4
Figure 3 shows the entire population displayed on a map of the US, illustrating the different
response rates across States. The height of the glyphs corresponds to the size of the
population in each State, the colour coding being used to illustrate mean(Response).
Figure 4 illustrates the different response rates across California, obtained using the
drill down feature from the map displayed in Figure 3.
To ensure the model was not over-fitting, a test/training approach was used by
splitting the learning set into two (equal) halves, building the model on the training set,
and looking at performance on the test set as illustrated in Figure 5. The depth of the
glyphs show the population count and the height of the gylphs and colour coding are
used to illustrate mean(Response). The x-axis corresponds to the score from the
scorecard model predicting response, whereby responders are assigned a higher score
than non-responders. The y-axis corresponds to the populations used to build and
validate the scorecard.
Figure 5
The final model combines a 'predicted donation given response' model (built using
regression trees and direct regression) with a 'likelihood of response' model (based on
an additive scorecard). Together these models give an unconditioned predicted donation amount.
The first component of the model is based on 6 variables relating primarily to previous
gift amounts as illustrated in Figure 6. The second component is based on
approximately 10 variables, including information about previous gift behaviour (time
and amount), and demographic information (the majority of these fields are derived, i.e.
not present in the original dataset). These variables are each classed into between 2-6
attributes.
Figure 6
Figure 6 illustrates the variables used in the predicted donation amount model. The
population illustrated is donors who have made a donation in response to the most
recent mailing. The colour coding refers to mean(donation amount) and height of bars
corresponds to population count. Donors with larger LastGift, AvgGift or most
recent donation values can be seen to have a higher donation amount.
The output from the final model (combining predicted donation given response) was
used to rank the learning dataset (Figure 7). The learning set was then classed into 10
equal population groups using this ranking and the observed mean(Target_D) for
each group was calculated for the learning dataset. The holdout dataset was then
ranked in the same manner as the learning dataset and every record in each decile was
assigned the mean(Target_D) value from the equivalent decile in the learning set
(Figure 8). We did this primarily because of the specific success criterion used in the
competition as this clearly reduces the utility of the prediction for ranking potential donors.
Figure 7
In Figure 7 the learning dataset was ranked by predicted donation amount. Colour and
height correspond to mean(Target_D) and depth corresponds to population count.
Figure 8
Figure 8 shows the hold-out dataset ranked by predicted donation amount. The
mean(Target_D) for each ranking bin was applied to the hold-out dataset with the
same ranking. Colour and height correspond to mean(Target_D), which has been
applied from the learning dataset and depth corresponds to population count.
Figure 9
Results
Mailing everyone predicted to donate more than $0.68 (57,836 people), resulted in
actual profits on the hold-out data of $13,954. Figure 9 illustrates a lift curve,
which shows the expected performance of the model with respect to a random model.
Conclusion
Quadstone's Decisionhouse has been designed to handle the entire analysis process from data
preparation through to modelling, post-processing and model operationalisation. Decisionhouse normally has
two routes to giving businesses the most valuable results from their data. The first is through its power to analyse
very large volumes of data. The second through enabling the business analyst, primarily marketers, to interact with
the data mining process using their business knowledge to guide the analysis.
This result is clear confirmation that the Quadstone approach delivers world class models.
Copyright © 1998 Quadstone Limited
|