Urban Science wins the KDD-98 Cup
A second straight victory for GainSmarts
Background
GainSmarts is a premier data-mining tool that provides solutions to database marketers,
analysts, as well as, statisticians. GainSmarts has been developed by Drs. Jacob Zahavi
and Nissan Levin of Tel Aviv University and Urban Science. GainSmarts is a fully automated
tool consisting of several suites. These suites cover all aspects of data-mining, ranging
from data import, sampling, data cleaning, preprocessing, automatic transformations,
feature selection, model building, cross-validation, scoring and reporting. For further
information on GainSmarts visit our web page at http://www.urbanscience.com and select GainSmarts.
Algorithm/Model
The competition for the KDD-98 cup was based upon actual data donated by The Paralyzed Veterans
of America (PVA). Each record in the training PVA dataset represented a previously lapsed donor
and included their response to a recent mailing campaign including donatiopn amount (if
applicable). The competitors were asked to calibrate a model using their data-mining tool
to predict the donation amount. The competitors were evaluated based upon maximizing the net
donations for the campaign (total donations minus contact costs). GainSmarts applied a
two-stage regression model (similar to Heckman's model) to predict the donation amount. The
first step of the two stage model is a classification model (we used Logisitc Regression)
applied to all prospects, where each prospect is assigned a probability
of donation. The second step is an estimation model (we used Linear Regression) applied
to the responding donors. This second model produces a conditional donation amount. The
product of the probability of donation (from step 1) and the conditional donation amount
(from step 2) produces an unconditional prediction of donation amount.
Modeling Process
- Split of the dataset into train (calibration) and test (validation).
- Explode raw variables into predictors using transformations. A variable
such as AGE can be used to create four binary catagorical variables based upon the
distribution of AGE by quartile. Several transformations are created for each variable.
For example, AGE can also be transformed into: Chi-Square categories, a LOG transform,
and a Piece-Wise Linear transform. Each type of transformation of an individual variable
is referred to as a set of predictors. GainSmarts arranges these predictors hierarchically
and then tests each set to determine the "best" transformation to represent the variable in
the subsequent modeling processes.

- Univariate analysis by individual predictor
- Correlation analysis by predictor (within the hierarchy) to eliminate highly
correlated predictors.
- GainSmarts selects the best available representation for each attribute using
an expert system (rule based) approach, thereby selecting either AGE by QUARTILES, or
Piece-Wise Linear transform for AGE, or ...etc.
- Select the best set of attributes using a stepwise methodology.
- Correlation analysis across all remaining attributes to remove highly correlated
attributes.
- Select the final set of predictors in the model, using a rule based mechanism,
to eliminate overfitting. This is achieved by limiting the number of coefficients (or weights), proper setting of parameters and introducing/eliminating entire representations of variables.
- Parameter estimation and calibration
- Cross validation and generate output (to EXCEL)
- Model scoring (or code generation)
Note: The process from 2-10 was repeated for both stages of the modeling process.
Therefore, each stage of the modeling process could contain it's own unique
variables with unique transformations.
Results
A comparison between the projected and actual results (less than 1% error) indicates
that the model developed was very robust and reliable.
Conclusion
Urban Science attributes our KDD cup successes to our feature selection expert system. This
expert system includes (implicitly) the many years of experience of Drs. Zahavi and Levin
in developing models and data mining systems. GainSmarts also practically automates the
entire modeling process. The manual labor consisted of running 3 types of models/algorithms
and then comparing the results. Urban Science invites data-miners to request a trial
version of our software and run it themselves on the PVA database (once it becomes public, as planned).
For further information or to comment upon the competition, please feel free to
email or call Nitin Agrawal
Data Mining Project Manager at Urban Science (+313-259-9900 or 800-321-6900 toll free
in the U.S.)