Data Science of Variable Selection: A Review
There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject. This is an overview of some of these approaches.
By Thomas Ball, Advanced Analytics Professional.
Data scientists are always stressing over the “best” approach to variable selection, particularly when faced with massive amounts of information -- a frequent occurrence these days. "Massive" by today's standards means terabytes of data and tens, if not hundreds, of millions of features or predictors. There are many reasons for this “stress” but the reality is that a single, canonical solution does not exist. There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject.
For years, there have been rumors that Google uses all available features in building its predictive algorithms. To date however, no disclaimers, explanations or working papers have emerged that clarify and/or dispute this rumor. Not even their published patents help in the understanding. As a result, no one external to Google knows what they are doing, to the best of my knowledge.
One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vis-a-vis pure data mining. The classically trained can get quite dogmatic about the need for "rigor" in model design and development. The fact is that when confronted with massive numbers of candidate predictors and multiple possible targets or dependent variables, the classic framework neither works, holds nor provides useful guidance – how does anyone develop a finite set of hypotheses with millions of predictors? Numerous recent papers delineate this dilemma from Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data (available here) who state, "The key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning." To last year's AER paper on Prediction Policy Problems by Kleinberg, et al., (available here) which makes the case for data mining and prediction as useful tools in economic policy making, citing instances where "causal inference is not central, or even necessary."
The fact is that the bigger, $64,000 question is the broad shift in thinking and challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on "obsolete" scientific thinking (available here) as well as this recent article by Eric Beinhocker on the "new economics" (available here) which presents some radical proposals for integrating widely disparate disciplines such as behavioral economics, complexity theory, network and portfolio theory into a platform for policy implementation and adoption. Needless to say, these discussions go far beyond merely statistical concerns and suggest that we are undergoing a fundamental shift in scientific paradigms. The shifting views are as fundamental as the distinctions between reductionistic, Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude or multiple explanations which roughly states that if several findings explain something, retain them all (see, e.g., here).
Of course, guys like Beinhocker are totally unencumbered with practical, in the trenches issues regarding applied, statistical solutions to this evolving paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable selection, there are many viable approaches to model building that leverage, e.g., Lasso, LAR, stepwise algorithms or "elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can't use all of the available information at the same time – there simply isn’t enough RAM to load it all in. What does this mean? Workarounds have been proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to "divide and conquer" or "bags of little jacknife" algorithms for massive data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data (available here) as well as Leskovec, et al's book Mining of Massive Datasets (available here).
There are now literally hundreds, if not thousands of papers that deal with various aspects of these challenges, all proposing widely differing analytic engines as their core from so-called “D&C" (divide and conquer) or "BLJ" (bags of little jacknife) algorithms; unsupervised, "deep learning" models; random matrix theory applied to massive covariance construction; Bayesian tensor models to classic, supervised logistic regression, and more. Fifteen years or so years ago, the debate largely focused on questions concerning the relative merits of hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper addressing these issues, Ainslie, et al. (available here) came to the conclusion that, in practice, the differing theoretical approaches produced largely equivalent results with the exception of problems involving sparse and/or high dimensional data -- where HB models had the advantage. Today with the advent of D&C-type workarounds, any arbitrage HB models may have historically enjoyed are rapidly being eliminated.
The basic logic of these D&C-type workarounds are, by and large, extensions of Breiman's famous random forest technique which relied on bootstrapped resampling of observations and features. Breiman did his work in the late 90s on a single CPU when massive data meant a few dozen gigs and a couple of thousand features processed over a couple of thousand iterations. On today's massively parallel, multi-core platforms, it is possible to run algorithms analyzing terabytes of data containing tens of millions of features that build millions of "RF" mini-models in a few hours. Theoretically, it’s possible to build models using petabyes of data with these workarounds but the present IT platforms and systems won’t execute that yet – to the best of my knowledge (if any knows where this is being done and how, please feel free to share that information).
There are any number of important questions coming out of all of this. One has to do with a concern over a possible loss of precision due to the approximating nature of these workarounds. This issue has been addressed by Chen and Xie in their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data (available here) where they conclude that these approximations are indistinguishably different from "full information" models.
A second concern which, to the best of my knowledge hasn't been adequately addressed by the literature, has to do with what is done with the results (i.e., the "parameters") from potentially millions of predictive mini-models once the workarounds have been rolled up and summarized. In other words, how does one execute something as simple as "scoring" new data with these results? Are the mini-model coefficients to be saved and stored or does one simply rerun the D&C algorithm(s) on new data?
In his book, Numbers Rule Your World (available here), Kaiser Fung describes the dilemma Netflix faced when presented with an ensemble of only 104 models handed over by the winners of their competition. The winners had, indeed, minimized the MSE vs all other competitors but this translated into only a several decimal place improvement in accuracy on the 5-point, Likert-type rating scale used by their movie recommender system. In addition, the IT maintenance required for this small ensemble of models cost much more than any savings seen from the "improvement" in model accuracy.
Then there's the whole question of whether "optimization" is even possible with information of this magnitude. For instance, Emmanuel Derman, the physicist and financial engineer, in his autobiography My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial engineering.
Finally, questions concerning relative feature importance with massive numbers of features have yet to be addressed.
There are no easy answers wrt questions concerning the need for variable selection and the new challenges opened up by the current, Epicurean workarounds remain to be resolved. The bottom line is that we are all data scientists now.
Bio: Thomas Ball is an advanced analytics leader with Fortune 500 and start-up experience. He has led teams in management consulting, digital media, financial and health care industries.
Source: Originally posted anonymously by the author to a thread on Stack Exchange's statistical Q&A site, Cross Validated. Reposted with permission.
- Datasets Over Algorithms
- Why Implement Machine Learning Algorithms From Scratch?
- Beyond One-Hot: an exploration of categorical variables