|Sponsored by the
American Association for Artificial Intelligence,
in cooperation with the
Committee on Data Engineering (TCDE). Cosponsored by
Westside Ballroom South, 5th Floor
Westside Ballroom North, 5th Floor
|2:00-3:00||Session Th1. Chair: Alex Tuzhilin
Invited Talk: Don Haderle,
|T1. Heikki Mannila,
Data Mining from a Database Perspective
|3:00-3:30||Report from the Interface Conference, John Elder||T1 - continued|
|3:30-4:00||Report from the VLDB Conference, Rakesh Agrawal||T1 - continued|
Westside Ballroom South, 5th Floor
Westside Ballroom North, 5th Floor
|4:30-6:00|| Paper Session #1, "Classification"
Chair: Ernest Chan
|T2. Jagadish and Faloutsos
Data Reduction Techniques and Spatial Data Mining
|6:00-6:30||T2 - continued|
|Time||Tutorial Track 1
Westside Ballroom South, 5th Floor
|Tutorial Track 2
Westside Ballroom North, 5th Floor
|Tutorial Track 3
Marquis Ballroom, 9th floor
|8:00-10:00||T3. Grossman and Bailey, High-performance DM: How to Scale ML Methods to DBs||T4. Donoho and Bennett, Fraud Detection Applications||T5. Banks and Levenson, Multivariate regression methods, high-dimensional problems|
|10:30-12:30||T8. Elder and Abbott, Software Tools: Comparisons of Vendor's Tools||T7. Jensen and Provost, Evaluation Methodologies||T6. Marron, Smoothing methods, structure from noise|
4:30-5:15, Exhibits Talk (Duffy/Columbia, 7th Floor)
Gordon Linoff (Data Miners, firstname.lastname@example.org), Data Mining in the Real World,
This presentation will discuss the issues of data mining in the real world. It will touch on the relationship of data mining with a data warehouse (is it really easier?) and on issues related to managing data and choosing particular techniques.
Session Fr2, Chair: Kyuseok Shim
We will first segment the data mining tools marketplace. We will then learn how to differentiate between the many data mining tools for the best return on investment. We will finally review the summary results of a real-life data mining tool evaluation case study. Come see what all the hype is all about!
10:30-12:00, Session Sat2, "Clustering". Chair: Jiawei Han
4:30-6:00, Session Sat4. Chair: Sam Uthurusamy
8:30-10:00, Session Sun1: "Discovery in Time". Chair: Pedro Domingos
10:30-12:30, Session Sun2. Chair: Jan Zytkow
2:00-4:00, Session Sun3. Chair Gregory Piatetsky-Shapiro
Workshop to Room assignments will be made later.
Moderator: Rakesh Agrawal, IBM Almaden Research Center
Panelists: Umeshwar Dayal (H.P. Research), Surajit Chaudhary (Microsoft Research), Tomasz Imielinski (Rutgers University), Heikki Mannila (Univeristy of Helsinki, Finland), and Jiawei Han (Simon Fraser University, Canada)
The initial research on data mining was concentrated on identifying mining operations and developing algorithms for them. Most early data mining systems were developed largely on file systems and specialized data structures and buffer management strategies were devised for each algorithm. Coupling with database systems was at best loose, and access to data in a DBMS was provided through an ODBC or SQL cursor interface. An important recent trend in information processing is the installation of large data warehouses built around relational database technology in an ever increasing number of enterprises. The current interest in data mining is driven largely by the desire to be able to mine nuggets of knowledge from these data warehouses. This panel will address architectural issues in coupling data mining to database systems. Specifically, the panel will explore the following questions:
Moderator: Ellen Spertus, Mills College and Computer Professionals for
Panelists: Jason Catlett, Junkbusters Corp.; Dan Jaye, Engage Technologies; and Daryl Pregibon, AT&T Labs
Data mining allows unprecedented opportunities for targeted marketing, which can be seen either as a boon for advertisers and consumers or as an enormous invasion of privacy. This panel considers ownership of personal information and explores how maximum benefits can be obtained from data mining while respecting individuals' privacy.
Panelists: Graham Spencer, Excite; Gerald Fahner, Fair, Isaac & Co.; and Paul DuBose Analytika, Inc.
Truly successful technologies become invisible. Data mining has a long way to go - or does it? Most stories of data mining applications focus on an expert's use of raw technology to solve one particular problem, but millions of people use or are affected by data mining technology every day, without even being aware of it. The panelists will discuss examples and characteristics of this "behind-the-scenes" variety of data mining.
Database Methods for Data Mining
Heikki Mannila, University of Helsinki, Finland
The tutorial explains the basic database techniques that can be used in data mining. We start by discussing the basics of database methods and the ideas behind data warehousing. Then we describe OLAP (on-line analytical processing) and the similarities and differences between OLAP and data mining.
The main part of the tutorial shows how simple descriptive patterns such as association rules, sequential patterns, and integrity constraints can efficiently be found from large databases using relatively straightforward methods. After that we discuss methods and data structures that can be used to store large amounts of multidimensional data, and how learning algorithms can be scaled to handle large datasets. The final part of the tutorial concerns emerging trends such as discovery from semistructured data such as web pages.
Heikki Mannila is a professor of computer science at the University of Helsinki in Finland. He works currently on data mining on event sequences and semistructured data, foundations of data mining and problems of fitting large statistical models on big datasets. He is Editor-in-Chief of the Data Mining and Knowledge Discovery journal.
H. V. Jagadish, AT&T Laboratories and Christos Faloutsos,
Carnegie Mellon University and University of Maryland, College Park
Given a warehouse with a very large data set, we describe how to reduce the data set size used for analysis, through appropriate approximate representations. In the data mining process, there is a critical data representation step, typically including data reduction, after data acquisition and cleaning, but before data analysis. In this tutorial we will learn about the choices one can make in this step, and the tradeoffs involved.
The central issue in data reduction is to control the error introduced by the approximation and to trade this off against the savings in storage and processing cost. Given the rich diversity of data analysis techniques, a matching diversity of corresponding data reduction techniques is required. We will describe several of these techniques, including some recent database tools for data mining in massive datasets: feature extraction for multimedia indexing by content, singular value decomposition for lossy compression, and methods for information reconstruction.
Prerequisites: Familiarity with B-trees, and with basic concepts of relational database systems (tables, attributes, joins)
Christos Faloutsos is working on physical data base design, text and spatial access methods, multimedia databases and data mining. He has received two "best paper" awards (SIGMOD 94, VLDB 97). He has filed for three patents, and he has published over 70 refereed articles and one monograph. He is currently on leave at Carnegie Mellon University.
H. V. Jagadish obtained his PhD from Stanford University in 1985, and has since been with AT&T, where he currently heads the Database Research Department. He has published over 75 articles, and has filed over 30 patents. He was the co-organizer of an expert workshop that resulted in the "New Jersey Data Reduction Report".
A Tutorial Introduction to High Performance Data Mining
Robert Grossman, Magnify, Inc. and National Center for Data Mining,
University of Illinois at Chicago and
Stuart Bailey, National Center for Data Mining, University of Illinois at Chicago
Data mining is automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. Scaling data mining to large data sets is a fundamental problem, with important practical applications. The goal of the tutorial is to provide researchers, practitioners, and advanced students with an introduction to mining large data sets by exploiting techniques from high performance computing and high performance data management. We will describe several architectural frameworks for high performance data mining systems and discuss their advantages and disadvantages. We will use several case studies involving mining large data sets, from 10-1000 Gigabytes in size, as running examples.
Robert Grossman is the President of Magnify, Inc. and Director of the National Center for Data Mining at the University of Illinois at Chicago. He has been a leader in the development of high performance and wide area data mining systems for over ten years. He has published widely on data mining and related areas and speaks frequently on the subject.
Stuart Bailey is a member of the technical staff at the National Center for Data Mining at the University of Illinois at Chicago. He has led the software development effort for the Terabyte Challenge Data Mining Demonstrations during the last three Supercomputing Conferences.
Fraud Detection and Discovery
Steven K. Donoho and Scott W. Bennett, SRA International, Inc.
This tutorial covers automated techniques for detecting fraud in areas such as health care, insurance, banking, telecommunications, and finance. Particular emphasis is given to how knowledge discovery techniques can be used to discover new fraud scenarios. Fraud detection is ripe for the application of KDD techniques because of the large volume of data involved and the amount of money lost each year to fraud. Of the over $1 trillion spent on health care each year, the U.S. General Accounting Office estimates that fraud accounts for 3% to 10%. Cellular phone fraud in the US is estimated at $5 billion per year.
We cover typical fraud detection workflow issues such as realtime vs. offline detection, alert investigation, gathering further information, alert explainability, and actions taken when fraud is still suspected after investigation. KDD issues covered include profiling, supervised and unsupervised approaches, analyzing sequences of events, practices vs. isolated incidents, and concept drift. Participants will go away with an understanding of fraud detection work to date, how discovery fits into the larger scheme of fraud detection, and challenges, pitfalls, and open research issues related to fraud detection.
Steve Donoho is a research scientist at SRA International. He received his PhD in 1996 from the University of Illinois focusing on automated change of representation techniques. His work at SRA has included customizing KDD techniques to detect fraud on the NASDAQ Market, devising scalable techniques for very large data sets, and developing data visualization tools for analyzing suspected fraud cases.
Scott Bennett is Technology Director of the Intelligent Information Systems Division at SRA International. His data mining experience includes work with very large structured and unstructured data sets, parallel implementations, discovery and detection algorithms, and graphical tools for analysis of mining results. He received his PhD in 1993 from the University of Illinois in machine learning and planning.
New-Wave Nonparametric Regression Methods for KDD
David Banks and Mark S. Levenson, National Institute of Standards and Technology
In the last decade, statisticians have developed a group of modeling procedures designed to handle large, high-dimensional datasets. These procedures, which we call new-wave nonparametric regression methods, tend to be very flexible and interpretable. They offer competitive alternatives both in applicability and feasibility to traditional statistical and machine learning methods. The first half of this tutorial will review the current front-running methods, including MARS (Multivariate Adaptive Regression Splines), PPR (Projection Pursuit Regression), AM (Additive Models), recursive partitioning regression (known commercially as CART), ACE (Alternating Conditional Expectations), AVAS (an approach based on variance stabilization), neural nets, and LOESS (locally weighted regression), as well as classical multiple linear regression and stepwise regression. The second half of the tutorial will address practical considerations in the choice and use of the methods. This will be based on a large-scale simulation experiment and examples of the methods applied to various datasets. Freeware software will be used to demonstrate the procedures.
David Banks is a mathematical statistician, at the National Institute of Standards and Technology (and formerly Associate Professor of Statistics at Carnegie Mellon University). He is Co-editor of the Encyclopedia of Statistical Sciences, and Chair-Elect of the Classification Society of North America. An applied statistician, he pursues forays into mathematics and information technology.
Mark Levenson is a mathematical statistician at the National Institute of Standards and Technology. He received a Ph.D. in 1993 from the Department of Statistics at The University of Chicago. His research interests are in the areas of image processing, data mining, and statistical problems in the physical and engineering sciences.
Smoothing Methods for Learning from Data
J. S. Marron, University of North Carolina
Real data examples are used to illustrate how smoothing methods provide a powerful tool for gaining insights from data. The three crucial issues for the application of smoothing methods in massive data contexts are:
The fast smoothing method that scales up well is called "binning" or "WARPing", and will be explained together with some enhancements. The large literature on data based choice of window width for smoothing methods will not be reviewed in detail since the "family approach", seems more useful for data mining applications. Determination of statistical significance of features will be addressed from the viewpoints of formal mode testing, and SiZer (based on scale space ideas).
J.S. Marron is a professor of Statistics at the University of North Carolina, Chapel Hill. In 1982, he earned a PhD in Mathematics from the University of California, Los Angeles. His research interests include: statistical smoothing methods, smoothing parameter selection, fast implementations of smoothers and curves and surfaces as data.
Evaluating Knowledge Discovery and Data Mining
Foster Provost, Bell Atlantic Science and Technology and
David Jensen, University of Massachusetts, Amherst
Both the science and practice of KDD stand to benefit from a common understanding of the strengths and limitations of the many frameworks for evaluating results. We will explain and criticize a wide variety of evaluation techniques, illustrating the similarities, but focusing on the important small differences. We first discuss the difference between evaluating models and evaluating model-building algorithms, which leads into a description of the traditional scientific frameworks for comparing KDD results. We then show where these frameworks are weak statistically and recommend techniques for strengthening them. Next, we discuss weaknesses of these frameworks when it comes to the practical application of data mining results. We show how to make evaluations more robust for a wide variety of real-world data mining scenarios, comparing and contrasting metrics such as sensitivity, specificity, positive predictive value, precision, and recall, and frameworks such as lift and ROC curves. Finally, expanding our view, we consider the general problem of searching for interesting patterns. We describe a diverse collection of techniques, including Bayesian and Bonferroni adjustments, blindfold trials, interestingness criteria, and the use of prior domain knowledge.
Foster Provost's research concentrates on weakening the simplifying assumptions that prevent inductive algorithms from being applied successfully. He received his Ph.D. in Computer Science from the University of Pittsburgh in 1992. He has worked on automated knowledge discovery in science, and is currently with Bell Atlantic Science and Technology.
David Jensen is research assistant professor of computer science at the University of Massachusetts, Amherst. His research focuses on learning and KDD, particularly the statistical properties of KDD algorithms. He is managing editor of Evaluation of Intelligent Systems, a web-accessible resource about empirical methods for studying AI systems.
A Comparison of Leading Data Mining Tools
John F. Elder IV and Dean W. Abbott, Elder Research Several high-performance, but costly, software products for Knowledge Discovery and Data Mining have recently been introduced. Most feature multiple modeling and classification algorithms and/or increased support for key data-handling and interpretation stages of the KDD process. Still, they compete with a healthy (and growing) lineup of desktop products vying for survival - some of which are focused on particular vertical markets. Natural questions to ask include: "Which product is best?", "Will a general-purpose tool suffice for my application?", and "Are the high-end ones worth it?" This tutorial will address such questions by providing an overview of the current field of Data Mining software tools. The instructors will highlight the distinctive properties and relative strengths of several major products, and share practical insights and observations from their use.
Though technical in parts, the tutorial should benefit both "enterprise technologists" and "line-of-business executives". All participants having the focus of a practical application to solve should gain insight into tools likely to add near-term value.Outline: (Brief) Descriptions of Algorithms: classical statistical, neural network, decision tree, polynomial network, density-based, rule induction. Survey of High-End Products: ease of use, comparative strengths, distinctive properties, cost. Lower-End Products with Complementary Abilities [as time allows]
John Elder heads a small, but growing, Data Mining research and consulting firm in Charlottesville, Virginia. He chairs the Adaptive and Learning Systems Group of the IEEE-SMC, is an Adjunct at the University of Virginia, has created influential DM algorithms and short courses, and writes and speaks often on KDD.
Dean Abbott is a Senior Research Scientist at Elder Research, in San Diego, California. He has Engineering and Mathematics degrees from the University of Virginia and Rensselaer Polytechnic Institute, and experience at three consulting firms. An expert in pattern discovery, Mr. Abbott has designed and implemented algorithms for commercial DM software.