News
From: Arnold Goodman
Date: 23 Nov 2001
Subject: Commentary on KDD-2001 or what is Data Mining and Statistics
STATISTICS IS THE ROAD FROM DATA MINING TO KNOWLEDGE DISCOVERY
by Arnold Goodman, Associate Director,
UCI Center for Statistical Consulting, agoodman@uci.edu
I have spent forty years as a statistician within information
technology and I founded the Annual Symposia on the Interface of
Computing Science and Statistics. When I discovered data mining in
1997, I hoped that data mining and statistics would contribute to each
other and benefit from each other in solving client problems, by the
cross-fertilization of their approaches and results.
Although Interface '01 featured both data mining and
bioinformatics, KDD-2001 not only did not feature statistics, but also
seemed arrogant in its mistreatment of statistics. Still most data
miners remain ignorant of statistics, most statisticians remain
ignorant of data mining, and they continue to sarcastically criticize
each other: arrogance and ignorance are a self-destructive
combination.
My comments need to be sufficiently negative for penetrating the
atmosphere of ignorance and arrogance, yet sufficiently positive for
motivating data miners to approach statistics productively. I offer
suggestions for data miners to improve and for program planners to
improve KDD-2002.
Unfortunately, the anti-statistical attitude will keep data mining
from reaching its actual potential. Such an attitude is also
increasing the probability of it following artificial intelligence and
expert systems through the typical computer technology stages of hype,
then hope, and finally has-been.
Data mining becomes knowledge discovery when it interprets and
assesses the mined patterns and relationships, according to "Principles
of Data Mining" by David Hand, Heikki Mannila and Padhraic Smyth. This
analytic journey, from the questions discovered in data through
answers developed from data to reasons provided by data, depends upon
statistical methods and thinking.
Randomness in data can be handled only by statistics, not by any
amount of database technology: making sense of data has always been
statistics, whether admitted or not. Data miners must stop relying
mostly on computational algorithms and denying a requirement for
statistical modeling.
My suggestion to data miners is a shift in attitude and a new appreciation of the requirements for:
- Broad statistical thinking to achieve what technology alone will not be capable of achieving
- A broader perspective on problems and a conscious openness to ideas from other disciplines
I challenge key data miners to start a constructive dialogue with
Interface statisticians and others.
KDD-2001 had twice the attendance and cost, but half the breadth and
quality, of Interface '01: just ask any data miner or statistician who
happened to attend both Interface '01 and KDD-2001. Everyone I
spoke to who was competent in statistics was under-whelmed with the
KDD program.
In the three-and-a-half days, there was one "pearl of wisdom" and
it involved data mining with statistics: Russ Altman characterized the
new biology as going from idea through collected data to suggest an
hypothesis and to more data for testing this hypothesis as being true
or being false.
The panel on sampling had no one who was knowledgeable in sampling:
why did statisticians not want to participate? Of those presentations
I attended, around 1/3 were technically excellent, 1/3 were only good,
and 1/3 were either fair or poor. That leaves much room for 2002
improvement.
My corresponding suggestion to KDD-2002 program planners is a focused increase in effort:
- To improve breadth, invite more presenters and relevant tutorials with a prior guidance
- For higher quality, provide stronger guidance to both your keynoters and your panelists
[Note from the Editor: While I don't think that data miners and KDD-2001 have
anti-statistical attitude, I agree with Arnie that data miners should pay more
attention to statistics.
Why are many data miners ignorant of statistics?
If you have an opinion on this, please email to editor and
I will include selected comments in next KDnuggets News. GPS]
|