Ronny Kohavi advice for students and young researchers in DM (KDnuggets News 07:23, item 6, Features)

KDnuggets : News : 2007 : n23 : item6

Features

Subject: Ronny Kohavi advice for students and young researchers in DM

Gregory PS: What advice would you give for students and young researchers studying data mining?

Ronny Kohavi:
I’ll make four recommendations: work with real data, spend time learning and thinking about the fundamental concepts, read papers with healthy skepticism, and always drill-down (or peel the onion).

Work with real data.
The UCI machine learning repository, which I consider semi-real data is significantly better than artificial distributions. KDD Cups contain good data, as well as the NetflixÂ’s dataset. Data sets that are real have noise, errors, and they are a much closer representation of real-life problems than artificial datasets. What they canÂ’t get in the lab is the ability to deploy your algorithm or idea in real life. What I like about my current group at Microsoft, the Experimentation Platform (http://exp-platform.com), is that weÂ’re not doing data mining in some back room hoping someone will use it, but weÂ’re involved in live experiments, and can quickly measure the impact on the business.

There are fundamental concepts in data mining that everyone should be aware of, yet I see researchers that are oblivious to them at times, probably because they are not aware of them. Make sure to internalize them. The easiest one is

don’t train on the test set.

It is amazing how much better you can do if you use the test set. It seems obvious, yet you see papers where researchers discretize real-valued attributes into discrete ranges using the whole data set, then split it into a training and test set. If the discretization algorithm used a piece of the test set (e.g., the min/max of the range may be impacted, or labels may have impacted the thresholds) it’s invalid. Some researchers used the "wrapper" approach to feature or parameter selection, and reported the best result without keeping an independent test set.

What are some fundamental concepts that are not always taught in introductory books and classes? Here are three:

the bias-variance tradeoff,
no-free-lunch, and
Simpson’s paradox.

The bias-variance tradeoff explains why sticking a powerful piece of hardware in the closet for six months and doing a three-level lookahead to find the optimal structure doesn’t work. Too many papers have reported "surprising results" that are simply explained by the bias-variance decomposition of errors.

The no-free-lunch theorems by Wolpert, and similar observations about version spaces, point out something obvious: you can’t generalize without assumptions. When developing an algorithm, can you identify the circumstances under which the algorithm will outperform others? For example, feature selection algorithms (of which decision tree induction is one instance) are assuming that some features are less important (or irrelevant). We know that learning is impossible in all cases, just like compression is impossible without assumptions. The good news is that gzip does save space in practice, and that learning algorithms do work in real life.

Finally, Simpson’s paradox

(GPS: this paradox occurs where the successes of groups seem reversed when the groups are combined. This may happen when a hidden variable, irrelevant to the individual group assessment, must be used in the combined assessment. Here are some examples of the Simpson’s paradox )

Simpson's paradox is an observation that should always cause you to think twice about hidden factors. I gave a talk at the Bay Area ACM Data Mining SIG in June 2006 (http://exp-platform.com/acmDMSig.aspx) and shared some examples, and I had a flood of e-mails afterward. If you haven't seen some examples, the math is trivial, but it's called a paradox because it's unintuitive.

My third recommendation is to view research papers with healthy skepticism, and try to replicate the results. Jerry Friedman at Stanford (co-author of CART, and one of my advisors) used to say that

when you read a paper where the author shows quantitative comparisons of their algorithm with several others, ignore the author’s result (usually at the top) and gather insights about how the other algorithms perform.

As researchers, we fall in love with our algorithms and keep tweaking them against the same datasets, thus inadvertently overfitting the data.

My fourth recommendation is to constantly be drilling down into the details and raw data. There’s a nice quotation by Toby Harrah that says:

Statistics are like bikinis; they show a lot, but not everything.

Over and over again, I learned that one has to drill-down to get fundamental insights. Most of the time, an amazing result will be traced to an error in the collection, the process, or the data manipulations (Twyman’s law says that any statistic that appears interesting is almost certainly a mistake). Sadly, this is true in many cases, but after nine drill-downs that lead to nothing, you will find a true gem.

KDnuggets : News : 2007 : n23 : item6

PREVIOUS | NEXT