By Tim Graettinger, Ph.D.
Before diving into a data mining project, an organization must consider whether it has the quantity and quality of data necessary to produce a successful predictive model. After the model is built, however, sometimes it does not perform as well on test data or new, real-world data as it did on the data used to create the model. Did you really have enough data? If so, what went wrong and how do you fix it??
The question of data quantity does not stand apart from data quality, sampling, and a host of related other issues. Entire books describe these aspects in great detail. My goal is more modest: to discuss the symptoms and the treatments for one of the most common misconceptions about data quantity for data mining - believing (or hoping) that plenty of data is available for model building when it really isn't.
This article tackles that scenario and in the process tells you how to avoid and how to fix the classic data mining pitfall of overfitting.