| KDnuggets : News : 2007 : n23 : item6 | |
FeaturesSubject: Ronny Kohavi advice for students and young researchers in DM
Gregory PS: What advice would you give for students and young researchers studying data mining?
Work with real data. There are fundamental concepts in data mining that everyone should be aware of, yet I see researchers that are oblivious to them at times, probably because they are not aware of them. Make sure to internalize them. The easiest one is
don’t train on the test set. It is amazing how much better you can do if you use the test set. It seems obvious, yet you see papers where researchers discretize real-valued attributes into discrete ranges using the whole data set, then split it into a training and test set. If the discretization algorithm used a piece of the test set (e.g., the min/max of the range may be impacted, or labels may have impacted the thresholds) it’s invalid. Some researchers used the "wrapper" approach to feature or parameter selection, and reported the best result without keeping an independent test set. What are some fundamental concepts that are not always taught in introductory books and classes? Here are three:
The no-free-lunch theorems by Wolpert, and similar observations about version spaces, point out something obvious: you can’t generalize without assumptions. When developing an algorithm, can you identify the circumstances under which the algorithm will outperform others? For example, feature selection algorithms (of which decision tree induction is one instance) are assuming that some features are less important (or irrelevant). We know that learning is impossible in all cases, just like compression is impossible without assumptions. The good news is that gzip does save space in practice, and that learning algorithms do work in real life. Finally, Simpson’s paradox (GPS: this paradox occurs where the successes of groups seem reversed when the groups are combined. This may happen when a hidden variable, irrelevant to the individual group assessment, must be used in the combined assessment. Here are some examples of the Simpson’s paradox ) Simpson's paradox is an observation that should always cause you to think twice about hidden factors. I gave a talk at the Bay Area ACM Data Mining SIG in June 2006 (http://exp-platform.com/acmDMSig.aspx) and shared some examples, and I had a flood of e-mails afterward. If you haven't seen some examples, the math is trivial, but it's called a paradox because it's unintuitive. My third recommendation is to view research papers with healthy skepticism, and try to replicate the results. Jerry Friedman at Stanford (co-author of CART, and one of my advisors) used to say that when you read a paper where the author shows quantitative comparisons of their algorithm with several others, ignore the author’s result (usually at the top) and gather insights about how the other algorithms perform. As researchers, we fall in love with our algorithms and keep tweaking them against the same datasets, thus inadvertently overfitting the data. My fourth recommendation is to constantly be drilling down into the details and raw data. There’s a nice quotation by Toby Harrah that says: Statistics are like bikinis; they show a lot, but not everything. Over and over again, I learned that one has to drill-down to get fundamental insights. Most of the time, an amazing result will be traced to an error in the collection, the process, or the data manipulations (Twyman’s law says that any statistic that appears interesting is almost certainly a mistake). Sadly, this is true in many cases, but after nine drill-downs that lead to nothing, you will find a true gem.
|
| KDnuggets : News : 2007 : n23 : item6 | |
Copyright © 2007 KDnuggets. Subscribe to KDnuggets News!