Grab Bag 2: More Frequently-Asked Questions (and Answers) about Data Mining
By Tim Graettinger, PhD.
While helping present a monthly webinar on Data Mining, I'm asked some challenging and really pivotal questions about DM and predictive analytics:
- How much data do I need for data mining?
- Why doesn't my predictive model perform as well on new data as it does on the training data?
- Are new data mining/predictive analytics/modeling algorithms needed to produce better results?
Question 1: How much data do I need for data mining?
This is by far the most common question people have about data mining (DM), and it's worth asking why this question gets so much attention. I think it's almost a knee-jerk response when you first encounter data mining. You have data, and you want to know if you have enough to do anything useful with it from a DM perspective. But despite the apparent simplicity of the question, it is unwise to try to answer without digging deeper and asking yet more questions. My goal here is to provide you with the guiding principles you need understand so you can ask those next questions. You'll even get a rule of thumb so you can produce your own estimate of the data you'll need for DM.
One guiding principle is based on relationship complexity, that is, the complexity of the relationship you want to model. The more complex the relationship, the more data you need to model it accurately. Duh, right? But, ask yourself, "What's the problem with this guiding principle?" Did you say, "I don't know how complex the relationship is?" Good.
From a practical perspective, it's useful to think of complexity in terms of the number of factors that might play a role in the relationship. Let's say that you want to predict customer churn. Think about the probable factors that might impact churn, such as: tenure, age of the customer, number of complaints, and total lifetime value of purchases, among others. Are there 4 probable factors or 14? Don't be concerned about fine precision here. You just want to get in the right ballpark.