KDnuggets : News : 2009 : n16 : item27 < PREVIOUS | NEXT >

Publications


Subject: Breiman’s Quiet Scandal: Stepwise Logistic Regression and RELR

Daniel M. Rice, Rice Analytics, St. Louis MO
Copyright © 2009 Rice Analytics. All Rights Reserved.

Introduction

Leo Breiman, one of the most influential statisticians of recent memory, referred to the model selection problem that is apparent in stepwise logistic regression as the "quiet scandal" of statistics (Breiman, 1992). One problem is that arbitrary criteria are used to arrive at the stepwise model, such as an arbitrary cutoff involving the statistical significance of a variable’s regression coefficients. Additionally, there is no attempt to model and reduce error in regression coefficients, so regression coefficients and their statistical significance can be quite unreliable across independent samples unless the sample size is very large. With arbitrary and unreliable selection criteria, entirely different variable sets will be selected by different modelers and by different samples. Also, the processing time in stepwise logistic regression makes it infeasible to model interactions involving a large number of variables. Hence, as hinted in Breiman’s famous chiding remark, stepwise logistic regression is notorious for giving arbitrary and unreliable models that may completely miss important interactions. Unfortunately, there has been no better alternative that overcomes these problems and still gives a parsimonious model. Thus, most businesses still use stepwise logistic regression to model probability or risk in applications such as credit scoring, insurance risk, pharmaceutical treatment outcomes, consumer attitudes, and customer satisfaction where there is a desire to have a transparent model with few variables.

Recent evidence suggests that Reduced Error Logistic Regression (RELR) may represent a better alternative. RELR models and reduces error as part of the maximum likelihood solution, so its regression coefficients are very stable across independent samples. Also, there are no arbitrary criteria involved in the Parsed RELR variable selection that returns the parsimonious solution that is the super maximum likelihood solution across variable sets, so different modelers will generate the identical model given the identical training data. Additionally, RELR allows the modeling of interactions involving a very large number of variables. For these reasons, RELR is much less susceptible to the reliability and interpretive validity problems surrounding stepwise logistic regression. This may be especially important in the United States in the increasingly regulated financial, insurance, health, pharmaceutical and automobile industries. In these industries, logistic regression models of probability and risk ultimately determine the nature of the product or service offered and who may purchase. The large failure of this probability and risk modeling in many of these same industries is now viewed as a primary cause of the global recession. Hence, arbitrary and unreliable methods like stepwise logistic regression are now even more difficult to defend. Business managers and statisticians will need to consider any better alternative.

Read more.


KDnuggets : News : 2009 : n16 : item27 < PREVIOUS | NEXT >

Copyright © 2009 KDnuggets.   Subscribe to KDnuggets News!