Training Sets, Test Sets, and 10-fold Cross-validation
More generally, in evaluating any data mining algorithm, if our test set is a subset of our training data the results will be optimistic and often overly optimistic. So that doesn’t seem like a great idea.
Editor's note: This is an excerpt from Ron Zacharski's freely available online book titled A Programmer's Guide to Data Mining: The Ancient Art of the Numerati.
At the end of the previous chapter we worked with three different datasets: the women athlete dataset, the iris dataset, and the auto miles-per-gallon one. We divided each of these datasets in turn into two subsets. One subset we used to construct the classifier. This data set is called the training set. The other set was used to evaluate the classifier. That data is called the test set. Training set and test set are common terms in data mining.
People in data mining never test with the data they used to train the system.
You can see why we don't use the training data for testing if we consider the nearest neighbor algorithm. If Marissa Coleman the basketball player from the above example, was in our training data, she at 6 foot 1 and 160 pounds would be the nearest neighbor of herself. So when evaluating a nearest neighbor algorithm, if our test set is a subset of our training data we would always be close to 100% accurate. More generally, in evaluating any data mining algorithm, if our test set is a subset of our training data the results will be optimistic and often overly optimistic. So that doesn’t seem like a great idea.
How about the idea we used in the last chapter? We divide our data into two parts. The larger part we use for training and the smaller part we use for evaluation. As it turns out that has its problems too. We could be extremely unlucky in how we divide up our data. For example, all the basketball players in our test set might be short (like Debbie Black who is only 5 foot 3 and weighs 124 pounds) and get classified as marathoners. And all the track people in the test set might be short and lightweight for that sport like Tatyana Petrova (5 foot 3 and 108 pounds) and get classified as gymnasts. With a test set like this, our accuracy will be poor. On the other hand, we could be very lucky in our selection of a test set. Every person in the test set is the prototypical height and weight for their respective sports and our accuracy is near 100%. In either case, the accuracy based on a single test set may not reflect the true accuracy when our classifier is used with new data.
A solution to this problem might be to repeat the process a number of times and average the results. For example, we might divide the data in half. Let’s call the parts Part 1 and Part 2:
We can use the data in Part 1 to train our classifier and the data in Part 2 to test it. Then we will repeat the process, this time training with Part 2 and testing with Part 1. Finally we average the results. One problem with this though, is that we are only using 1/2 the data for training during each iteration. But we can fix this by increasing the number of parts. For example, we can have three parts and for each iteration we will train on 2/3 of the data and test on 1/3. So it might look like this
In data mining, the most common number of parts is 10, and this method is called ...
10-Fold Cross Validation
With this method we have one data set which we divide randomly into 10 parts. We use 9 of those parts for training and reserve one tenth for testing. We repeat this procedure 10 times each time reserving a different tenth for testing.
Let’s look at an example. Suppose I want to build a classifier that just answers yes or no to the question Is this person a professional basketball player? My data consists of information about 500 basketball players and 500 non-basketball players.
Often we will put the final results in a table that looks like this:
So of the 500 basketball players 372 of them were classified correctly. One thing we could do is add things up and say that of the 1,000 people we classified 652 (372 + 280) of them correctly. So our accuracy is 65.2%. The measures we obtain using ten-fold cross-validation are more likely to be truly representative of the classifiers performance compared with twofold, or three-fold cross-validation. This is so, because each time we train the classifier we are using 90% of our data compared with using only 50% for two-fold cross-validation.
To read more from this particular discussion, see chapter 5 of Ron Zacharski's A Programmer's Guide to Data Mining: The Ancient Art of the Numerati.
Bio: Ron Zacharski is a Zen Buddhist monk and computational linguist living in Las Cruces, New Mexico, and the author of "A Programmer's Guide to Data Mining: The Ancient Art of the Numerati." His Erdõs number is 3. He has a scientific productivity h-index of 14 and g-index of 41.
- Visualizing Cross-validation Code
- How (and Why) to Create a Good Validation Set
- Data Mining Techniques, Free Chapter: Derived Variables – Making the Data Mean More