--
Discovery in Databases (KDD) community, focusing on the latest research and
applications.
Submissions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to kdd@gte.com
To subscribe, email to kdd-request@gte.com
message with
subscribe kdd-nuggets
in the first line (the rest of the message and subject are ignored).
See http://info.gte.com/~kdd/subscribe.html
for details.
Nuggets frequency is approximately 3 times a month.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site http://info.gte.com/~kdd
-- Gregory Piatetsky-Shapiro (editor)
********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************
January 20, 1997, Issue: 614
Section: InformationWeek Labs
Debunking Data-Mining Myths --
Don't let contradictory claims about
data mining keep you from improving
your business
By Robert D. Small
A great deal of what is said about data mining is
incomplete, exaggerated, or wrong. Data mining has
taken the business world by storm, but as with many
new technologies, there seems to be a direct
relationship between its potential benefits and the
quantity of often-contradictory claims, or myths,
about its capabilities and weaknesses. It's difficult to
fight these myths, which are based on
misunderstandings, hopes, and fears. The new
technology cycle typically goes like this: Enthusiasm
for an innovation leads to spectacular assertions.
Ignorant of the technology's true capabilities, users
jump in without adequate preparation or training.
Then, sobering reality sets in. Finally, frustrated and
unhappy, users complain about the new technology
and urge a return to 'business as usual.' When you
undertake a data-mining project, avoid a cycle of
unrealistic expectations followed by disappointment.
Understand the facts instead, and your data-mining
efforts will be successful. - Simply put, data mining
is used to discover patterns and relationships in your
data in order to help you make better business
decisions.
Myth: Data mining produces surprising results that
will utterly transform your business.
Fact: Most often, the results of data mining yield
steady improvement to an already successful
organization, often contributing important incremental
changes rather than revolutionary ones.
Nevertheless, data mining can lead to significant
change in several ways. First, it may give the talented
business manager a small advantage each year, on
each project, with each customer. Compounded over
a period of time, these small advantages turn into a
large competitive edge. For example, a catalog retailer
that can better target its mailing list can increase
profits by reducing the cost of mailings while
increasing the number of orders. Over time, this can
result in a substantially more profitable business.
Second, data mining occasionally does uncover one
of those rare 'breakthrough' facts, such as scientists'
noticing the association between the fatal Reyes
Syndrome and children taking aspirin.
In short, data mining is a powerful search tool for
forward-looking companies.
Myth: Data-mining techniques are so sophisticated
that they can substitute for domain knowledge or for
experience in analysis and model building.
Fact: No analysis technique can replace experience
and knowledge of the business and its markets. On
the contrary, data mining makes education and
experience in many areas more important than
ever.While experts may need to learn new analytical
techniques to stay current and make leading-edge
contributions, someone who's an expert only in
analytical techniques, without having knowledge of
the business, is of no help.
Experience in building models, however, can ensure
more profitable use of data mining, since data
mining is simply the newest tool for building models.
The less domain knowledge a data mining expert
brings to a problem, the more important it is to
perform the data mining in close cooperation with
people who understand the business.
Similarly, the less skill and experience that business
experts have in modeling and using the associated
tools, the more help they need from data-mining
experts in leveraging their business knowledge.
For example, financial analysts seeking to increase the
return on their clients' investments may ask an expert
data miner to analyze a large, complex database on
previous clients. The data miner may discover that
certain variables predict success in investing, but it
takes a financier to know whether it's legal to influence
those variables.
Myth: Data-mining tools automatically find the
patterns you're looking for, without being told what to
do.
Fact: Data mining is most cost-effective when used
to solve a particular problem. Although a data-mining
tool can indeed explore your data and uncover
relationships, it still needs to be directed toward a
specific goal. Simply giving a data-mining tool a
mailing list and expecting it to find customer profiles
that improve the efficiency of a direct-mail campaign
is not particularly effective. You need to be more
specific in your goals. For example, to improve the
value of mailing-list responses, your model might
emphasize customers who have previously bought
expensive items; to increase the number of
responses, your model might emphasize customers
who have responded to previous mailings.
Myth: Data mining is useful only in certain areas, such
as marketing, sales, and fraud detection.
Fact: Virtually any process from pharmacology to
customer service can be studied, understood, and
improved using data mining. These techniques are
being applied to such diverse applications as
manufacturing process control, human resources, and
food-service management.
Data mining is useful wherever data can be collected.
Of course, in some instances, cost/benefit
calculations might show that the time and effort of the
analysis is not worth the likely return. For example,
suppose you suspect that if you collect just one more
piece of information about your customers, you could
double the number of orders you received. But you
also know that mailing to twice as many people will
also double the number of orders. If gathering the
data is more expensive than sending the extra
mailings, then it makes sense to increase the mailings
rather than mine the data.
Myth: The methods used in data mining are
fundamentally different from the older quantitative
model-building techniques.
Fact: All methods now used in data mining are natural
extensions and generalizations of analytical methods
known for decades. Neural nets, a special case of
projection pursuit regression, were developed in the
1940s. CART (classification and regression trees)
methods were used by social scientists in the 1960s.
K-nearest neighbor, a form of density estimation, has
been used for a half-century.
All these methods-just like regression
techniques-model relationships between a set of
profile variables and an outcome.
What's new in data mining is that we're now applying
these techniques to more general business problems,
thanks to the increased availability of data and
inexpensive processing power.
Furthermore, because communication between the
business community and methodologists, who are
mainly academics, has often been poor, there was,
until recently, no user-friendly software for
implementing these methods. The recent interest in
data mining is in part due to the improved user
interfaces that make these techniques more available
to business experts.
The rise of these powerful methods is a great step
forward, but the old tools are still valuable. Varieties
of regression techniques, discriminant analysis, and
even simple graphs can help reveal hidden patterns.
No single method solves all or even a majority of
problems. Successful data mining requires a portfolio
of tools, both old and new.
Myth: Data mining is an extremely complex process.
Fact: The algorithms of data mining may be complex,
but new tools have made those algorithms easier to
apply. Often, just the correct application of relatively
simple analyses, graphs, and tables can reveal a great
deal about our business. Much of the difficulty in
applying data mining comes from the same
data-organization issues that arise when using any
modeling techniques. These include data preparation
tasks-such as deciding which variables to include and
how to encode them-and deciding how to interpret
and take advantage of the results.
Myth: Only massive databases are worth mining.
Fact: It's true that many methods used in data mining
were specifically developed for analyzing very large
data sets, and that many data-mining applications
involve massive data sets. But a moderately sized or
small data set can also yield valuable information. For
example, buying patterns may depend most strongly
on the day of the week or the time of the year. A
modest database consisting of only 'day' and 'sales'
could show this pattern, give the retailer some idea of
its magnitude, and allow for planning of inventory and
staffing.
Even when building a massive database, try out some
simple analysis on the data while the database is still
moderate in size. You may decide to collect the data
differently or to collect different data altogether.
Myth: Data mining is more effective with more data,
so all existing data should be brought into any
data-mining effort.
Fact: More data items are useful only if they
contribute more information about the issues at hand,
or goals. Otherwise, they can be worse than
worthless. A database may have a great deal of
information about an item (or about the relationship
between items) but nothing about other items that are
actually closely related. For example, a company may
have information about how customers use one credit
card, but nothing about how those customers use
their other credit cards.
However, adding data with little information content
can actually lower the predictive power of the
database. By including irrelevant data or adding
multiple measurements of the same item, the utility of
the data-mining results will be reduced. For example,
if you include age as well as birth date, the analysis
tool will discover that both factors are equally relevant
and will therefore assign a lower weight to both
measures as predictors.
Myth: Building a data-mining model on a sample of a
database is ineffective, because sampling loses the
information in the unused data.
Fact: The thrust of almost all developments in the
study of sampling is to maximize the amount of
information gained per unit of effort expended.
Keep in mind that your data probably already
represents a sample of a larger population. When you
analyze your customer database to help acquire new
customers, you're basing your model on a sample of
the total population.
Under some circumstances, you may be forced to
sample. Not all your data may be relevant to the
problem at hand or reflect the population you're trying
to model. Many data warehouses include historical
data that reflects conditions-such as unexpired
patents-that no longer apply, rendering it
inappropriate for building a model to guide future
decisions.
Sometimes full-scale data-gathering is not practical.
For example, if you'd like to learn about customers'
satisfaction with your new product or service, but it
takes an hour to administer a customer satisfaction
survey, you'll most likely decide to limit your analysis
to a sample.
In fact, a relatively small random probability sample,
correctly taken, can yield excellent results. Although
there are 60 million or more voters in a presidential
race, the final poll before the election, which is based
on two-thousandths of 1% of those voters, is seldom
off by more than 2%. If we had a database of all 60
million voters and hundreds of measurements on each
one, we couldn't build a better model for predicting
the winner.
Even when it's possible to build the model on the
entire database, you may choose not to. It's often a
better use of resources to build and evaluate many
models using samples of the data, rather than rely on a
single model using all the data.
Myth: Data mining is another fad that will soon fade,
allowing us to return to standard business practice.
Fact: Although the name may change, data mining as
a vital application will not go away. Companies have
been using related quantitative techniques in many
parts of their businesses for a long time. Data mining
is just one more advance in a research process that
has been ongoing since the beginning of the 20th
century. A recent increase in the power of computers,
coupled with cheap electronic methods for capturing
large amounts of data, brings us to this step now.
Data mining can't be ignored-the data is there, the
methods are numerous, and the advantages that
knowledge discovery brings to a business are
tremendous. Companies whose data-mining efforts
are guided by 'mythology' will find themselves at a
serious competitive disadvantage to those
organizations taking a measured, rational approach
based on facts.
Robert D. Small is VP of Research of Two Crows
Corp. in Potomac, Md. He can be reached at
bob@twocrows.com.
SIDEBAR: Six Steps For Successful Data Mining
- Identify the goal
- Assemble the relevant data
- Choose your analysis methods
- Decide which software tool is best for implementing
the method
- Run the analysis
- Decide how to implement the results
Data: Two Crows Corp.
Copyright ® 1997 CMP Media Inc.
Previous2NextTop
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: uffenheimer11@home.com
Date: Thu, 16 Jan 1997 22:15:57 -0800
Subject: EDS in roads into the data warehouse, datamining, DSS areas
EDS, the largest computer service provider in the world, has established
a focused consulting practice in the area of data warehousing, data
mining and decision support systems. EDS built a world-class integration
lab (in the domain of the insurance industry)to demonstrate
applications, test tools, integrate solution components and build proof
of concepts. For a free white paper and additional information, please
contact Nathan Uffenheimer at (972)604-8915.
Previous3NextTop
>~~~Publications:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: 'jpbrown' (jpbrown@hal-pc.org)
Organization: Ultimate Resources
Date: Thu, 16 Jan 1997 13:57:24 -0006
Subject: What Needs To Be Done, And Why.
Descriptive Introduction: The Databases that are the core of
Data Warehousing are not just repositories. Together, they
form an interactive machine that makes it possible to learn
much more about the constituent population or populations.
This expands on: http://www.hal-pc.org/~jpbrown
Text: Most data collections are hybrid in one way or another.
I have spent several years studying many actual cases. Over and
over again, I ran into the apples and oranges problem, where
there are sub-populations that are very different, one from
another. I do not need to tell you how confusing the results
of analysis can be, if these situations are ignored.
I have continued to devise ways to detect the anomalies of the
hybrid database, always assuming that some aspects of this problem
may be present, or may develop with the passage of time. If they
do develop as time goes on, there needs to be a method for
detecting the onset of Change. I have developed, and expect
to continue to develop, new methods to make effective, reliable
analyses in cases where hybrid sub-populations are recognized.
In using these techniques you can:
* take an unfamiliar population and diagnose potential problems.
* identify the causes of the problems.
* apply different methods that will measure the analyzability of
naturally occurring hybrid populations.
* suggest ways to increase the utility of data, or to point out
that some types of data are incurably unhelpful.
* use different techniques (Autoclassification) to separate out
sub-populations, based on predictability or other sources
of coherence.
* make reliable predictions.
* detect and remedy Changes in causal systems that would
otherwise reduce reliability.
So far, the great strides that have been taken in Databases, Data
Marts and Data Warehouses, have been advances in Data Manipulation.
The next great strides will be taken in SuperInduction, and they
will be applied before, during, and after the various steps of
manipulation.
The resulting Output:
* will be based, without prejudice (objectively), on the Input.
* will also have had the benefit of many kinds of new knowledge,
developed during the analytical process.
* and will be ideally presented to produce the best possible
results for the corporate user (Decision Support).
If you have gone through the Web Site http://www.hal-pc.org/~jpbrown
and you want to see some of the extra complex links, let me know at
jpbrown@hal-pc.org
Previous4NextTop
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 17 Jan 1997 08:46:53 -0500
From: fazel@ai.iit.nrc.ca
(Fazel Famili)
Subject: Intelligent Data Analysis Journal - First Issue is live
Intelligent Data Analysis - An International Journal (New)
An electronic, Web-based journal
Published by Elsevier Science
The first issue of Intelligent Data Analysis journal is on live. This is
a quarterly journal published by Elsevier Science Inc. The journal is
planning to offer a number of new features that are not currently available
in paper journals: (i) an alerting service notifying subscribers of new
papers in the journal, (ii) links to large-scale data collections, (iii)
links to secondary collection of data related to material presented in the
journal, (iv) the ability to test new search mechanisms on the collection
of journal articles, (v) links to related bibliographic material, and (vi)
inclusion of 3-D objects and multiple color graphs.
Please refer to one of the above sites that contain articles for the first
issue and journal home page (e.g. Aims and Scope, Author Submission Guide-
lines, and more).
Best wishes,
A. Famili
Editor-in-Chief
Previous5NextTop
>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 23 Jan 1997 10:02:10 -0500
From: binli@shasha.cs.nyu.edu
(Bin Li)
Subject: new siftware entry for PC4.5
Could you add an entry in the Siftware page for our parallel
C4.5 classification tool? Thanks,
_______
Bin Li
*Description: If you have C4.5 and a network of workstations that are
accessible to you, PC4.5 will help you better use C4.5. PC4.5 offers you
these advantages:
1. It is faster. In an N trial c4.5 run, a single process builds N
classification trees one by one and then picks the best one. In
PC4.5, the N trials are each handled by a process and each process
is run on a different machine (if N or more machines are available).
2. It is fault-tolerant. PC4.5 automatically assigns a process to
a machine if the machine is idle (i.e. no activity by the machine's
owner). If the owner of a machine comes back or it fails during
a PC4.5 computation, the PC4.5 process automatically retreats and
resumes on a different machine that is idle.
3. It supports multiple platforms. PC4.5 runs on SunOS, Solaris and
Linux machines (for HPUX, IRIX, and ALPHA, please contact author).
Networked multi-platform workstations can run PC4.5 processes of a
single PC4.5 program at the same time.
PC4.5 is built with the Persistent Linda (PLinda) system, a software system
for robust distributed parallel computing developed at New York University.
To get more information on PLinda, please visit our web site at http://merv.cs.nyu.edu:8001/~binli/plinda/
or send email to
plinda@cs.nyu.edu.
Both PC4.5 and PLinda are research efforts led by professor Dennis Shasha.
Important: You must have the original C4.5 package in order to use PC4.5.
To get C4.5, please contact Dr. J. R. Quinlan (quinlan@cs.su.oz.au).
*Discovery tasks: Classification
*Platform(s): Unix (SunOS, Solaris, Linux; please contact author for HPUX,
IRIX, and ALPHA)
*Contact: Bin Li
715 Broadway, Rm 715
New York, NY 10003
(212) 998-3485
email: binli@cs.nyu.edu
(preferred)
Previous6NextTop
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Previous7NextTop
>~~~Positions:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: 17 Jan 1997 12:04:57 +0000
From: 'Ed Babb' (Ed_Babb@parsys.co.uk)
Subject: kdd- job in data mining
OPPORTUNITY IN DATA MINING!
PARSYS is a leading European supplier of parallel systems and technology. They
are currently the lead partner in a large multinational ESPRIT project aimed at
building a parallel data mining file server. Consequently, they are looking for
people interested in data mining systems and with experience of parallel
computers, database technology and machine learning.
The positions involve adapting learning techniques such as rule induction,
neural networks, genetic algorithms to run on a parallel computer. Also helping
to adapt an existing database system to run on a parallel machine. Enthusiasm
for producing fast algorithms in C is essential.
At least a 2.1 degree in Computing, Artificial Intelligence or equivalent is
needed. In addition, several years relevant experience is desirable. Salary
will depend on age and experience.
Please post your CV stating current salary to: Ed Babb, PARSYS LTD, Boundary
House, Boston Road, Hanwell, London, W7 2QE, UK. Alternatively email him on
ed@parsys.co.uk
if you wish to make any brief informal enquires.
Please see http://www.parsys.com/dafs.htm
for summary of the DAFS project.
*********************************************
Previous8NextTop
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: djb@engr.uark.edu
(BERLEANT DANIEL J)
Date: Tue, 21 Jan 1997 08:20:37 -0600
Subject: POSITION: Tenure Track, Teaching and Research
This is an informal request for inquiries from people interested in
the tenure track position offered by our dept. starting next
September. Feel free to spread the word.
If you are interested in teaching two software related courses per
semester (typically one undergrad, one grad) and in doing research in
empirical NLP, text processing, information retrieval from full text,
data/knowledge mining from full text, etc., AND you have/are getting
Ph.D. and a formal qualification in engineering (Bachelor's, Master's,
or Ph.D. degree with the word 'engineering' in it or issued by a
dept., college, campus, or university with the word 'engineering' in
its name, etc.), please email me to discuss applying.
If you don't think you have an engineering degree, check - maybe
you'll be surprised.
I am very interested in promoting applications from people in the
above mentioned areas and look forward to responding forthrightly to
your inquiry.
Best Regards,
Daniel Berleant
Dept. of Computer Systems Engineering
University of Arkansas, Fayetteville
Phone: (501) 575-5590
Fax: (501) 575-5339
Email: djb@engr.uark.edu
Previous9NextTop
>~~~Meetings:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Mon, 20 Jan 97 12:54:56 PST
From: 'Dave Stodder' (dstodder@mfi.com)
To: kdd@gte.com
Subject: Data Mining Summit program
As you know, the 1997 Data Mining Summit is coming up Feb. 18-21 in
San Francisco. The conference is sponsored by Miller Freeman Inc.'s
Database Programming & Design and DBMS magazines.
We have a great lineup of speakers: Usama Fayyad, Evangelos
Simoudis, Kamran Parsaye, Larry Kershberg, Bob Vere, Gene Feruzza,
and others, including case studies. The complete program is located
at www.dbsummit.com.
I am attaching files of the complete program, if it would be
possible to include it with KDD Nuggets.
Thanks very much,
David
David Stodder
Conference Chair, Data Mining Summit
Editor-in-Chief, Database Programming & Design
411 Borel Ave., Suite 100
San Mateo, CA 94402
(415) 655-4290, Fax (415) 655-4350
Internet: dstodder@mfi.com
Tuesday, February 18
Data Mining and the Internet:
New Dimensions in Knowledge Discovery
Chaired by David Stodder
Editor-in-Chief
Database Programming & Design
Successful application of data mining tools and knowledge discovery tools
methods can have a tremendous effect on an organization. Combined with the
Internet, data mining explodes into a new world of possibility. Electronic
commerce and other activity will create huge new resources of data that
businesses can mine for greater efficiency and customer service. But perhaps
more importantly, data mining combined with Internet-based applications has
the potential to deliver whole new areas of profitable decision support
services.
This special seminar will focus on the dynamic combination of data mining,
advanced databases, and the Internet. Bringing a series of experts together,
this all-day session will cover key topics, including:
-- Development and use of intelligent software agents
-- How data mining fits with the technology advances made by commercial
search engines and browsers
-- Case studies of organizations that have created effective data mining
applications for Internet customers
-- Developments in heterogeneous database access to enable wider use of
data mining
-- Data mining and knowledge discovery methods that work best for
creating Internet-aware applications
-- Advances in graphics and data visualization that will impact Internet
data mining applications
For the latest news about this seminar, including the scheduled speakers,
please check back with this Web site. The complete program will be in place
in early December.
Wednesday, February 19
8:30 - 9:35
OLAP and Data Mining: Bridging the Gap
Part I
Kamran Parsaye
CEO
Information Discovery Inc.
To date, most observers have viewed data mining and online analytical
processing (OLAP) as separate components of decision support. It has been
difficult to link the two largely because no coherent theory exists upon
which to build a relationship. In this keynote speech, Parsaye will
introduce a unified theory and methodology for OLAP and data mining. He will
describe in detail how the two activities can reinforce each other.
Parsaye will begin by describing the 'dimensions' of decision support and
how data mining activity fits into one of the dimensions. Data mining within
a single dimension is a rough approximation of multidimensional mining.
Parsaye will describe how a lack of attention to dimensionality in data
mining can result in unexpected results reminiscent of the 'lossless join'
problem in the early days of relational databases.
In the second part of his presentation, Parsaye will present a formal
framework for mining OLAP data and will introduce a new set of
multidimensional normalization constructs that allow us to understand OLAP
discovery.
In this session you will learn:
- How OLAP, data mining, and other activities fit together in the four
'spaces,' or dimensions, of decision support
- Limitations of normalization and star schemas for data mining activities
- New structures that go beyond star schemas
- A methodology for applying OLAP data mining, with three distinct processes
of episodic, strategic, and continuous mining for specific user groups
within corporate environments.
Kamran Parsaye is CEO of Information Discovery Inc. He has developed
commercial data mining applications since the mid-1980s. Parsaye has a range
of experience in the software industry both in research and in business, and
has provided guidance to top-level management of leading industrial,
financial, and government organizations. He is coauthor of Intelligent
Database Tools & Applications (John Wiley & Sons, 1993).
9:45 - 10:50
OLAP and Data Mining: Bridging the Gap
Part II
Kamran Parsaye
CEO
Information Discovery Inc.
(For description, see above)
Break 10:50 - 11:10
11:10 - 12:15
Institutionalizing Knowledge Discovery: Creating a New Business Process
Tej Anand
Director of Knowledge Discovery
Human Interface Technology Center
NCR Corp.
Practitioners are slowly beginning to accept that knowledge discovery is
much more than just the application of machine learning or statistical
algorithms to a dataset. Researchers understand that a knowledge discovery
process exists, and they even agree on what basic tasks make up that
process. However, for knowledge discovery to move beyond finding
'interesting trivia' to become a business process akin to marketing, the
details behind the knowledge discovery process must be expounded. Anand will
take the process apart to reveal its details; he will offer practical ideas
for accomplishing business goals through a new understanding of the process.
In this session you will learn:
- Why knowledge discovery is so difficult (contrary to what you might have
heard)
- Why you cannot buy a tool to 'do' knowledge discovery for you
- How process templates can remind the practitioner of tasks he or she must
complete and can provide a framework for making, recording, and auditing
decisions during the knowledge discovery process
- How process guides help the practitioner select data transformation
techniques, interpret data visualizations, select the correct machine
learning or statistical algorithm, and interpret results
- How embedding templates and guides into tools will allow knowledge
discovery to become an institutionalized business process.
Tej Anand is director of the knowledge discovery team at NCR Corp.'s Human
Interface Technology Center. In 1993, he established this business and
technical consulting team to help retail, insurance, consumer packaged
goods, and other commercial enterprises realize business insights hidden in
their operational data. Team members also conduct research and development
to create knowledge discovery processes and data mining tools. Prior to
joining NCR, Anand developed data mining tools for A.C. Nielsen Co. He has
also been a member of the research staff at Philips Laboratories, where he
did research in the area of artificial intelligence software systems.
12:15 - 1:30
Lunch
Track A: Algorithms and Methods
1:30 - 2:35
Data Mining and the KDD Process: Algorithms and Limitations
Part I
Usama Fayyad
Senior Researcher
Microsoft Research
This two-part talk will provide an overview of the rapidly growing area of
knowledge discovery in databases (KDD). Fayyad will define KDD goals,
present motivations guiding the KDD process, and discuss how KDD relates to
data mining. He will then focus on the core data mining methods. These
methods have their origins in statistics, pattern recognition, artificial
intelligence (machine learning), databases, and parallel computing. Fayyad
will explore the limitations and challenges of each major data mining
method. He will break these methods down into classes and will cover a
sampling of algorithms for each class, outlining its advantages and=
limitations.
The goal of this two-part presentation is to provide a detailed snapshot of
the current state of data mining methods, how they fit into the KDD process,
and what key challenges developers should be aware of when applying them.
Fayyad will focus primarily on the technical aspects of the algorithms
rather than their use in particular implementations.
In this session you will learn:
- Definitions of KDD and data mining and how the two areas fit together
- Dominant data mining methods used in the field and the specific problems
they address
- Critical limitations and challenges of each method
- How to avoid pitfalls when applying data mining methods.
Usama Fayyad is a senior researcher at Microsoft Research. His interests
include knowledge discovery in large databases, data mining, machine
learning theory and applications, statistical pattern recognition, and
clustering. Before joining Microsoft in 1996, he headed the Machine Learning
Systems Group at the Jet Propulsion Laboratory (JPL), California Institute
of Technology, where he developed data mining systems for automated science
data analysis. He remains affiliated with JPL as a distinguished visiting
scientist. Fayyad received the JPL 1993 Lew Allen Award for Excellence in
Research and the 1994 NASA Exceptional Achievement Medal. He was program
cochair of KDD-94 and KDD-95 (the First International Conference on
Knowledge Discovery and Data Mining). He is general chair of KDD-96, an
editor-in-chief of the journal Data Mining and Knowledge Discovery, and
coeditor of Advances in Knowledge Discovery and Data Mining (MIT Press,=
1996).
2:45 - 3:50
Data Mining and the KDD Process: Algorithms and Limitations
Part II
Usama Fayyad
Microsoft Research
(For description, see above)
3:50 - 4:15
Break
4:15 - 5:00
Data Mining: The View from IBM
5:00 - 5:45
Data Mining: The View from Tandem Computers
Track B: Case Studies in Data Mining
1:30 - 2:35
Leveraging Customer Information for Competitive Advantage
Lisa Modisette
Director of Wireless Intelligent Solutions
Lightbridge Inc.
The cellular phone industry today looks much like the credit-card industry
of a few years ago. The market is growing at nearly 50 percent a year but
will reach a saturation point soon- just as the credit card industry has.
'Churn,' or customer attrition, is a growing problem for the maturing
cellular phone industry. In this case study, Modisette will describe how
data mining techniques that worked so well in the credit card industry to
prevent and reverse customer attrition may be applied to the wireless
telecommunications industry.
Modisette will describe how Lightbridge Inc., a wireless communications
provider, has used data mining tools to retain good customers at minimal
cost. Data mining tools make use of existing customer transactional and
demographic data, allowing companies to quickly and easily discover customer
needs. Detailed customer knowledge will enable carriers to prepare for a
more saturated market and offer new businesses based on customer knowledge.
In this session you will learn:
- How Lightbridge uses data mining and churn modeling techniques to combat
customer attrition
- Specific predictive modeling techniques and their effectiveness
- How to get the most out of existing data and acquire a deeper knowledge
of customer behavior.
Lisa Modisette is responsible for the development and marketing of
Lightbridge Inc.'s Wireless Intelligence line of products and services,
designed to provide decision support and database marketing to wireless
carriers. She joined Lightbridge in 1994 and has driven the development of
the new decision-support product line since its inception. Modisette has
experience in identifying customer needs and in creating and maximizing the
use of decision-support systems, database marketing, and customer
segmentation. Modisette also has expertise in OLAP, business intelligence,
database marketing, product management, sales training, and a variety of
information technology. Before joining Lightbridge, she was director of the
telecommunications industry practice at Metaphor Inc., an IBM subsidiary.
She has a B.A. in marketing from the University of Colorado.
2:45 - 3:50
Business Experiences with Data Mining
Evangelos Simoudis
Director of Data Mining Solutions
IBM Corp.
Health care and insurance are two industries that offer interesting
opportunities for data mining applications. In this presentation, Simoudis
will describe how two businesses have developed production data mining
systems. The Health Insurance Commission (HIC), an agency of the Australian
government, processes claims for Australia's Medicare, Medibank Private,
Pharmaceutical Benefits, and Child Care programs. HIC uses data mining to
help reduce costs by ensuring that all medical tests and services are
appropriately prescribed and accurately billed.
John Hancock, an insurance and financial services provider, has a marketing
and services database to support the company's cross-selling efforts and=
to
accurately identify future customer service requirements. Hancock developed
a survey of 55,000 targeted users; it uses data mining to provide profiles
based on survey results.
In this session you will learn:
-- Case study examples of data mining methods used for reducing costs and
profiling customers
-- The technology/business integration important for data mining success
-- Important processes to ensure accurate results from data mining
Evangelos Simoudis is IBM's director of Data Mining Solutions. Before
joining IBM, Simoudis led Lockheed Corp.'s data mining research, and was
responsible for the commercial introduction and marketing of Lockheed's
Recon data mining system for financial and retail markets. Simoudis also
spent six years as a member of the principal research staff at Digital
Equipment Corp.'s Artificial Intelligence Center. He conducted research on
machine learning, pattern recognition, knowledge-based systems, and
distributed artificial intelligence; Digital has incorporated his research
work in products for engineering design and diagnostics. Simoudis has
written extensively on data mining and machine learning, and is the North
American editor of the Artificial Intelligence Review.
3:50 - 4:15
Break
4:15 - 5:00
Data Mining: The View from Angoss Software
Thursday, February 20
8:30 - 9:35
Keynote Speech
Speaker TBA
9:45 - 10:50
Weaving Detail into the Big Picture
Denise M. Barnhart
Chief, Corporate Analysis Division
Army and Air Force Exchange Service
'There's too much data ... but it's just not enough.' With the continued
growth of very large databases (VLDBs) and the mushrooming need for quick
access to progressively smaller details of the retail business, corporations
risk losing sight of the larger view, the brighter opportunity, or the
insidious trend. The Army and Air Force Exchange Service (AAFES), which
provides $6 billion in goods and services to military servicemen and
servicewomen around the world, has taken on this challenge. In a case study
presentation, Barnhart will describe AAFES's extensive use of massively
parallel analytical processing and data mining. The organization uses this
advanced technology for retail research and integrating analysis results
with operational and strategic processes.
In this session you will learn:
- How AAFES uses neural nets to understand demographics and project market
potential
- Neural net applications that let an organization view data both at the
total business level and at the detailed level of specific items in a retail
store
- How AAFES calculates relationships between retail items and categories=
and
links these categories to demographic characteristics
- Techniques for the cross-utilization of multiple databases for configuring
retail stores to maximize corporate earnings per square foot
- How to overcome challenges in integrating database patterns into the
corporate strategic vision.
Denise Barnhart is chief of the Corporate Analysis Division, part of the
Army and Air Force's Exchange Service's (AAFES's) Strategic Planning
Directorate. AAFES is profit-generating agency of the Defense Department.
Barnhart joined AAFES in 1976 as a CPA and has since specialized in the
strategic optimization of stores for the benefit of both customer
satisfaction and bottom line. She was an early proponent of the day-to-day
use of neural nets in planning store construction in the late '80s. Today,
AAFES wholly plans mall sales and earnings levels, store mix, sizing, and
parking requirements with neural net analyses. With the refinement of retail
point-of-sale in the '90s, Barnhart has extended corporate strengths in
local markets.
10:50 - 11:10
Break
11:10 - 12:15
The Visualization of Large, Complex Datasets
Georges Grinstein
Professor, Institute for Visualization and Perception Research
University of Massachusetts Lowell
Visualization is the translation of data, sampled or generated, into some
perceptual presentation, most typically visual, to provide insights into the
data. It represents the mapping of data into a symbolic representation
useful for researchers, analysts, scientists, and business managers. This
'mapping,' or interaction, can occur at several stages of the=
visualization
presentation pipeline; it directs the transformations or alters the
presentation of data.
Visualization is no longer simply an application of computer graphics. While
computer graphics remain the underpinning technology of this discipline,
visualization now includes- and must support- databases, real-time
interaction, networking, supercomputing, multimedia, visual programming,
systems theory, and human perception. This development has provided some
very fertile ground for integrating knowledge discovery, statistics, and
visualization.
In this talk Grinstein will highlight key research issues in the
visualization of large, complex informational spaces.
In this session you will learn:
- A brief history of visualization, from initial efforts to extend data
presentation beyond the classic pixel-driven techniques to the current
challenge of encompassing domain knowledge
- How visualization and data mining can work together to provide rich
user-exploration and analysis environments
- How to make astute use of visualization techniques.
Georges Grinstein is a professor of computer science at University of
Massachusetts in Lowell, Massachusetts. He also serves as director of the
university's Institute for Visualization and Perception Research and is
principal engineer with MITRE Corp.'s Center for Air Force C3I Systems.
Track A: Algorithms and Methods
1:30 - 2:35
Improving Prediction Performance with Genetic Algorithms
Steven Vere
President
Ultragem Data Mining Co.
Data mining with genetic algorithms is a new technology aimed at improving
prediction performance. However, many of today's commercial data mining
products actually incorporate older machine learning algorithms, such as ID3
and CART. These systems use heuristic algorithms to generate decision rules.
Being heuristic, they do not guarantee the best in prediction performance;
in most cases, we now know they do not. Ten years ago, these technologies
represented a good trade-off between prediction performance and training
speed. But in today's high-speed computing environment, it is possible to
use the controlled, brute computational force of genetic algorithms to find
the higher performing prediction rules that heuristic algorithms overlook.
In this presentation Vere will describe techniques for efficiently applying
the genetic algorithm paradigm to large data mining problems.
In this session you will learn:
- The definition and description of genetic algorithms
- Applications of genetic algorithms to data mining and numerical=
prediction
problems
- How specific techniques, such as averaging the predictions of sets of
genetically generated classifiers, can significantly enhance performance.
Steven Vere is president and founder of Ultragem Data Mining Co., a data
mining consulting company specializing in the commercial application of
evolutionary algorithms. He has over 20 years of experience in machine
learning and artificial intelligence. Vere has served as a member of the
computer science faculty at the University of Illinois, Chicago and has also
held senior technical and management positions at the NASA Jet Propulsion
Laboratory, Lockheed R&D Division, and Bank of America. His work has
appeared in research journals, AI Encyclopedia, and Scientific American; he
will be featured on a future episode of Beyond 2000, a television
documentary series. Vere holds a Ph.D. in computer science from University
of California at Los Angeles.
2:45 - 3:50
Data Mining: Finding the Total Business Solution
Gene Feruzza
President, Customer Management Services
Too often, we view data mining as only data visualization, predictive
modeling, or some other specific technique. Although these components are
important, supporting the total business solution requires that we take a
much broader scope. In this talk, Feruzza will on data mining processes in
real-world applications developed in telecommunications, financial services,
utilities, and online services. He will describe the cyclical nature of
successful data mining, first focusing on the data infrastructure (data mart
or warehouse) and data access and manipulation. Feruzza will then describe
the role, and integration, of modeling processes and technologies, including
rule-based techniques, traditional statistics, neural networks, and genetic
approaches. He will discuss experiences with delivering the knowledge
obtained from the technology to the business user, and how promote the
strategic integration of technology and business applications.
In this session you will learn:
-- How to view the full scope of data mining needs to be to be successful.
-- Why it's important to embrace and support all modeling technologies,=
not
just one
-- Solutions to common pitfalls based on data mining experiences
-- Best practices for delivering knowledge gained to the business user
-- Why data mining should be a cyclical, 'living' process.
Gene Feruzza has extensive experience with advanced segmentation techniques
utilizing basic statistics and regression modeling, rule-based segmentation,
neural network modeling along with evolutionary and hybrid modeling
architectures. For 12 years he has provided integrated marketing and
business solutions for clients in telecommunications, electric utilities,
financial services, aerospace, manufacturing, and retail. He has worked for
two leading neural network hardware and software providers (HNC and Neural
Ware) as an instructor and consultant. He has also developed and marketed
his own database management and segmentation software. Feruzza graduated
from the University of Pittsburgh with a BS in computer science and=
mathematics.
4:15 - 5:00
Data Mining: The View from NeoVista
7:30 - 9:00
1:30 - 3:00
Birds of a Feather
Breakout Sessions
Success with data mining depends on an intimate knowledge of specific
industry application requirements. After the first Data Mining Summit last
April, we received many requests to include in the program organized
'networking' sessions for attendees to discuss specific industry=
challenges.
To close out the Second Annual Data Mining Summit, we invite attendees to
join in our special Birds of a Feather sessions, which will focus on data
mining issues faced by specific industries. A vertical industry expert will
lead each discussion group.
Come and share your questions and experiences with other like-minded data
mining practitioners! Depending on popularity, we plan to offer Birds of a
Feather sessions about data mining in the following industries:
- Retailing
- Health care
- Financial services
- Telecommunications
To help us organize the Birds of a Feather sessions ahead of the conference,
please use the registration form to choose which vertical industry session
you would like to attend.
Track B: Case Studies in Data Mining
1:30 - 2:35
Artificial Intelligence and Process-Delay Analysis: A Decision-Tree Case=
Study
Bob Evans
Member, Advanced Technology Staff
RR Donnelley & Sons Co.
Cylinder wear (called 'banding') causes serious delays in the=
rotogravure
printing process and has plagued the industry for decades. A process-delay
analysis initiative at RR Donnelley & Sons' Gallatin, Tennessee plant has
reduced the incidence of cylinder banding to near negligible levels. In this
presentation, Evans will describe the Evans-Fisher Process Analysis Model, a
solution driven by decision-tree induction. Through case study examples, he
will describe the use of this powerful artificial intelligence method for
data mining. Evans will also address some of the business and social issues
associated with data collection and analysis.
At RR Donnelley, database technology is the vehicle for solving process
problems. Evans will show how decision-tree induction may be viewed as
automated query generation. Attendees will see examples of queries generated
by this tool. Evans will explain how decision-tree induction guides users
away from the 'blind alleys' that can frustrate data mining efforts.
In this session you will learn:
- How to astutely define and collect data for decision-tree induction
- Case study examples of how the Evans-Fisher Process Analysis Model was
developed and applied
- How to use artificial intelligence and data mining to solve complex
industrial problems.
Bob Evans is on the advanced technology staff of RR Donnelley & Sons Co. in
Gallatin, Tennessee. He is also an adjunct assistant professor of computer
science at Volunteer State Community College in Tennessee. A 33-year
employee of RR Donnelley, he is responsible for implementing and upgrading
process-delay analysis using current data mining technology. He has
published several articles and has given presentations on shop-floor
applications of artificial intelligence. Computer scientists frequently cite
his application of decision-tree induction to cylinder bands as a successful
example of the transfer of data mining technology from the research
laboratory to an industrial environment. Evans holds an A.B. degree in
mathematics from Indiana University and a Master of Engineering degree in
computer science from Vanderbilt University.
2:45 - 3:50
Fraud Detection Systems: Combining Data Mining and Machine Learning
Tom Fawcett, Foster Provost
Members of the Technical Staff
Machine Learning Project
NYNEX Science and Technology
In this presentation, Fawcett and Provost will describe a framework that
combines data mining and machine learning techniques to design fraud
detection methods. Fraud detection is based on profiling customer behavior
and checking for anomalies. The domain of this case study is cloning fraud
in cellular telephony, but the methods involved are more widely applicable:
any domain in which fraudulent usage is superimposed upon legitimate usage
(as in credit card fraud) is a candidate. Fawcett and Provost use a
rule-learning program to uncover indicators of fraudulent behavior from a
large database of cellular calls. They will show how they use these
indicators to construct profilers and how their system combines evidence
from multiple profilers to generate high-confidence alarms.
In this session you will learn:
- How to create a profitable synergy of data mining and machine learning
- How to address the intricacies of building data mining systems under
real-world constraints
- Complications that arise when trying to assign cost/benefit trade-offs=
(the
cost of handling a false alarm differs from the cost of missing fraudulent
usage, which varies among fraud cases).
Tom Fawcett works in machine learning, data mining, and knowledge-based
systems. He has worked at NYNEX Science & Technology, GTE Laboratories, and
MITRE Corp. Fawcett holds a Ph.D. from the University of Massachusetts at
Amherst. While at GTE, his machine-learning system was used for automated
adaptation in telecommunications network management. He developed and
maintained a large knowledge-based mission planning system for MITRE.
Fawcett has published articles addressing the representation problem in
machine learning and has done research in case-based reasoning.
Foster Provost works on machine learning and data mining at NYNEX Science
and Technology, where, in addition to developing methods for the automated
design of fraud detection systems, he has also made advances by combining
data mining techniques with decision-analytic techniques for cost-effective
technician dispatch. Prior to joining NYNEX, Provost worked on data mining
in scientific domains, including botanical toxicology, high-energy physics,
and infant mortality. His work produced advances in rule learning, scaling
up machine learning methods to large databases, using background knowledge
to guide learning, and selecting inductive bias. Provost holds a Ph.D. from
the University of Pittsburgh, where he held IBM and Mellon graduate
fellowships. He received a B.S. in physics and mathematics from Duquesne
University. He is a recent recipient of NYNEX's President's Award.
4:15 - 5:00
Data Mining: The View from DataMind
7:30 - 9:00
Birds of a Feather
1:30 - 3:00
Success with data mining depends on an intimate knowledge of specific
industry application requirements. After the first Data Mining Summit last
April, we received many requests to include in the program organized
'networking' sessions for attendees to discuss specific industry=
challenges.
To close out the Second Annual Data Mining Summit, we invite attendees to
join in our special Birds of a Feather sessions, which will focus on data
mining issues faced by specific industries. A vertical industry expert will
lead each discussion group.
Come and share your questions and experiences with other like-minded data
mining practitioners! Depending on popularity, we plan to offer Birds of a
Feather sessions about data mining in the following industries:
- Retailing
- Health care
- Financial services
- Telecommunications
To help us organize the Birds of a Feather sessions ahead of the conference,
please use the registration form to choose which vertical industry session
you would like to attend.
Friday, February 21
8:30 - 9:35
Data Mining 1997/98: Key Trends & Market Perspectives
Aaron Zornes
Executive Vice President and ADS Service Director
Meta Group
Although the data mining market garnered less than $100 million in 1996,
industry analysts at Meta Group forecast the market will explode to more
than $800 million by the year 2000. During 2Q96, Meta Group surveyed 250+
Global 2000=96size business users of data mining products and services in
retailing, healthcare, financial services, and telecommunications. This
presentation will highlight key survey findings regarding adoption criteria,
timelines, technical parameters, and leading business applications. Meta
Group's study investigated not only the traditional uses of data mining
technology, such as fraud prevention and credit card authorization within
the financial services industry, but also investigated rapidly emerging
requirements stemming from data warehouse implementations and Web-enabled
commerce and marketing.
In this session you will learn:
- How to interpret early user adoption rates by industry segments
- What will be the impact of emerging systems integrators and data bureaus
- What's behind current data quality, data warehouse, and data=
visualization
trends
Aaron Zornes is executive vice president and ADS service director for Meta
Group. He is a leading authority on the software industry as it relates to
applications development and delivery- especially data warehousing and
second-generation multitier client/server applications. Zornes has devoted
more than 20 years to line and strategic management roles in leading vendor
and user organizations, including executive and managerial positions at
Ingres Corp., Wang Laboratories Inc., Software AG of North America, and
Cincom Systems Inc. He is a frequent author and keynote speaker on data
warehousing, data mining, advanced client/server tools, and customer-centric
application architectures. Since 1992, He has been conference chair of DCI's
Data Warehouse World conference series.
9:45 - 10:50
Knowledge Rovers: Configurable Agents to Support Enterprise Information
Infrastructures
Larry Kerschberg
Professor and Chair, Information and Software Systems Engineering
School of Information Technology and Engineering
George Mason University
Knowledge rovers represent a family of cooperating intelligent agents that
can support a collection of scenarios, decision-makers, and tasks. These
rovers play specific roles within the enterprise information infrastructure
to support users, maintain complex views, and mine and refine data into
knowledge. Rovers can roam the Internet, seeking, locating, negotiating for,
and retrieving data and knowledge specific to their mission.
For decision-makers to make appropriate use of information, the current
flood of data must be filtered and transformed. In this presentation,
Kerschberg will describe knowledge rovers and the data mining and software
agent technology that creates them. He will highlight important rovers and
how they fit into data warehouse, data mine, and data mart architectures.
Kerschberg will describe Field Agent rovers that discover new resources,
collect data, and bring back information; Information Curator rovers that
refine data into knowledge and place it in an information repository; and
Domain Servers that from within the repository facilitate access to multiple
data types, such as images, text, formatted data, and simulation data
related to a particular domain. Finally, Kerschberg will discuss Sentinal
rovers that monitor Domain Servers for interesting events, patterns, and
specified conditions to alert decision-makers and take actions on their=
behalf.
In this session you will learn:
- The role of intelligent agents in supporting enterprise information
architectures
- How to integrate a family of configurable rovers for discovery,
integration, and evolution of information
- The interrelationship among concepts such as data warehouses, data mines,
and information repositories in the enterprise information infrastructure
- The concept of virtual data mines and data mining over multiple
heterogeneous data sources.
Larry Kerschberg is professor and chair of the Department of Information and
Software Systems Engineering in the School of Information Technology and
Engineering at George Mason University in Virginia. He is also director of
the university's Center for Information Systems Integration and Evolution.
His research focuses on intelligent agents, intelligent information
integration, data mining and knowledge discovery in databases, and expert
database systems. His research is funded in part by DARPA. Kerschberg is
also President of KRM Inc., which pursues research and development in
knowledge rovers and mediators in intelligent information systems. He is
editor-in-chief of the International Journal of Intelligent Information
Systems, published by Kluwer Academic Publishing Co. Kerschberg organized
and has served as program chair of the First and Second International
Conferences on Expert Database Systems. He holds a Ph.D. in engineering from
Case Western Reserve University.
10:50 - 11:10
Break
11:10 - 12:15
Privacy Issues and Data Mining
Panel Session Chaired by
David Stodder,
Editor-in-Chief,
Database Programming & Design
Data mining tools, when combined with large, sophisticated databases,
already offer businesses and other organizations powerful new abilities to
learn more about clients, customers, citizens, and taxpayers. The Internet
and Web-enabled commerce will create vast sources of data and new ways to
package information databases as products and services. Privacy and security
specialists are becoming increasingly concerned that basic privacy rights
could be trampled in the race to provide modern, intelligent information
services. Businesses must take new security measures to protect proprietary
data- and learn how to resolve the tug-of-war with competitors and service
contractors over just who owns the data.
This panel session will feature a selection of experienced users, security
experts, and data mining professionals, who will focus on privacy and
security concerns that broadly effect the practice of data mining. The panel
will discuss what measures governments and business are taking- and should
take- with regard to data mining and the development of new information=
services.
David Stodder is editor-in-chief of Database Programming & Design. He has
been with the publication since its inception in 1987. He has served on the
advisory board of several industry conferences, including IDUG North
America, DCI's Database and Client/Server World, and Blenheim/NDN's DB/Expo.
He is also chair of Miller Freeman Inc.'s VLDB Summit, Object/Relational
Summit, and Business Rules Summit conferences.