KDnuggets : News : 2001 : n24 : item1    ( next)

News


From: Arnold Goodman
Date: 23 Nov 2001
Subject: Commentary on KDD-2001 or what is Data Mining and Statistics

STATISTICS IS THE ROAD FROM
DATA MINING TO KNOWLEDGE DISCOVERY

by Arnold Goodman, Associate Director,
UCI Center for Statistical Consulting, agoodman@uci.edu

I have spent forty years as a statistician within information technology and I founded the Annual Symposia on the Interface of Computing Science and Statistics. When I discovered data mining in 1997, I hoped that data mining and statistics would contribute to each other and benefit from each other in solving client problems, by the cross-fertilization of their approaches and results.

Although Interface '01 featured both data mining and bioinformatics, KDD-2001 not only did not feature statistics, but also seemed arrogant in its mistreatment of statistics. Still most data miners remain ignorant of statistics, most statisticians remain ignorant of data mining, and they continue to sarcastically criticize each other: arrogance and ignorance are a self-destructive combination.

My comments need to be sufficiently negative for penetrating the atmosphere of ignorance and arrogance, yet sufficiently positive for motivating data miners to approach statistics productively. I offer suggestions for data miners to improve and for program planners to improve KDD-2002.

Unfortunately, the anti-statistical attitude will keep data mining from reaching its actual potential. Such an attitude is also increasing the probability of it following artificial intelligence and expert systems through the typical computer technology stages of hype, then hope, and finally has-been.

Data mining becomes knowledge discovery when it interprets and assesses the mined patterns and relationships, according to "Principles of Data Mining" by David Hand, Heikki Mannila and Padhraic Smyth. This analytic journey, from the questions discovered in data through answers developed from data to reasons provided by data, depends upon statistical methods and thinking.

Randomness in data can be handled only by statistics, not by any amount of database technology: making sense of data has always been statistics, whether admitted or not. Data miners must stop relying mostly on computational algorithms and denying a requirement for statistical modeling.

My suggestion to data miners is a shift in attitude and a new appreciation of the requirements for:

  • Broad statistical thinking to achieve what technology alone will not be capable of achieving
  • A broader perspective on problems and a conscious openness to ideas from other disciplines

I challenge key data miners to start a constructive dialogue with Interface statisticians and others.

KDD-2001 had twice the attendance and cost, but half the breadth and quality, of Interface '01: just ask any data miner or statistician who happened to attend both Interface '01 and KDD-2001. Everyone I spoke to who was competent in statistics was under-whelmed with the KDD program.

In the three-and-a-half days, there was one "pearl of wisdom" and it involved data mining with statistics: Russ Altman characterized the new biology as going from idea through collected data to suggest an hypothesis and to more data for testing this hypothesis as being true or being false.

The panel on sampling had no one who was knowledgeable in sampling: why did statisticians not want to participate? Of those presentations I attended, around 1/3 were technically excellent, 1/3 were only good, and 1/3 were either fair or poor. That leaves much room for 2002 improvement.

My corresponding suggestion to KDD-2002 program planners is a focused increase in effort:

  • To improve breadth, invite more presenters and relevant tutorials with a prior guidance
  • For higher quality, provide stronger guidance to both your keynoters and your panelists
[Note from the Editor: While I don't think that data miners and KDD-2001 have anti-statistical attitude, I agree with Arnie that data miners should pay more attention to statistics. Why are many data miners ignorant of statistics? If you have an opinion on this, please email to editor and I will include selected comments in next KDnuggets News. GPS]

KDnuggets : News : 2001 : n24 : item1    ( next)

Copyright © 2001 KDnuggets.   Subscribe to KDnuggets News!