Urban Science wins the KDD-98 Cup

A second straight victory for GainSmarts

Background

    GainSmarts is a premier data-mining tool that provides solutions to database marketers, analysts, as well as, statisticians. GainSmarts has been developed by Drs. Jacob Zahavi and Nissan Levin of Tel Aviv University and Urban Science. GainSmarts is a fully automated tool consisting of several suites. These suites cover all aspects of data-mining, ranging from data import, sampling, data cleaning, preprocessing, automatic transformations, feature selection, model building, cross-validation, scoring and reporting. For further information on GainSmarts visit our web page at http://www.urbanscience.com and select GainSmarts.

Algorithm/Model

    The competition for the KDD-98 cup was based upon actual data donated by The Paralyzed Veterans of America (PVA). Each record in the training PVA dataset represented a previously lapsed donor and included their response to a recent mailing campaign including donatiopn amount (if applicable). The competitors were asked to calibrate a model using their data-mining tool to predict the donation amount. The competitors were evaluated based upon maximizing the net donations for the campaign (total donations minus contact costs). GainSmarts applied a two-stage regression model (similar to Heckman's model) to predict the donation amount. The first step of the two stage model is a classification model (we used Logisitc Regression) applied to all prospects, where each prospect is assigned a probability of donation. The second step is an estimation model (we used Linear Regression) applied to the responding donors. This second model produces a conditional donation amount. The product of the probability of donation (from step 1) and the conditional donation amount (from step 2) produces an unconditional prediction of donation amount.

Modeling Process

  1. Split of the dataset into train (calibration) and test (validation).
  2. Explode raw variables into predictors using transformations. A variable such as AGE can be used to create four binary catagorical variables based upon the distribution of AGE by quartile. Several transformations are created for each variable. For example, AGE can also be transformed into: Chi-Square categories, a LOG transform, and a Piece-Wise Linear transform. Each type of transformation of an individual variable is referred to as a set of predictors. GainSmarts arranges these predictors hierarchically and then tests each set to determine the "best" transformation to represent the variable in the subsequent modeling processes.

  3. Univariate analysis by individual predictor
  4. Correlation analysis by predictor (within the hierarchy) to eliminate highly correlated predictors.
  5. GainSmarts selects the best available representation for each attribute using an expert system (rule based) approach, thereby selecting either AGE by QUARTILES, or Piece-Wise Linear transform for AGE, or ...etc.
  6. Select the best set of attributes using a stepwise methodology.
  7. Correlation analysis across all remaining attributes to remove highly correlated attributes.
  8. Select the final set of predictors in the model, using a rule based mechanism, to eliminate overfitting. This is achieved by limiting the number of coefficients (or weights), proper setting of parameters and introducing/eliminating entire representations of variables.
  9. Parameter estimation and calibration
  10. Cross validation and generate output (to EXCEL)
  11. Model scoring (or code generation)
    Note: The process from 2-10 was repeated for both stages of the modeling process. Therefore, each stage of the modeling process could contain it's own unique variables with unique transformations.

Results

A comparison between the projected and actual results (less than 1% error) indicates that the model developed was very robust and reliable.

Conclusion

    Urban Science attributes our KDD cup successes to our feature selection expert system. This expert system includes (implicitly) the many years of experience of Drs. Zahavi and Levin in developing models and data mining systems. GainSmarts also practically automates the entire modeling process. The manual labor consisted of running 3 types of models/algorithms and then comparing the results. Urban Science invites data-miners to request a trial version of our software and run it themselves on the PVA database (once it becomes public, as planned).


    For further information or to comment upon the competition, please feel free to email or call Nitin Agrawal Data Mining Project Manager at Urban Science (+313-259-9900 or 800-321-6900 toll free in the U.S.)