--
Data Mining and Knowledge Discovery community, focusing on the
latest research and applications.
Submissions are most welcome and should be emailed, with a
DESCRIPTIVE subject line (and a URL) to gps.
Please keep CFP and meetings announcements short and provide
a URL for details.
KD Nuggets frequency is 3-4 times a month.
Back issues of KD Nuggets, a catalog of data mining tools
('Siftware'), pointers to Data Mining Companies, Relevant Websites,
Meetings, and more is available at Knowledge Discovery Mine site
at
********************* Official disclaimer ***************************
All opinions expressed herein are those of the contributors and not
necessarily of their respective employers (or of KD Nuggets)
*********************************************************************
~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'hey man of science with your perfect rules of measure,
can you improve this place with the data that you gather?'
from the song 'I Want To Conquer The World' (Gurewitz)
in the album 'No Control' by Bad Religion.
(thanks to Pier Luca Lanzi (lanzi@elet.polimi.it))
Previous1NextTop
Date: Thu, 7 Aug 1997
From: Gregory Piatetsky-Shapiro (gps@kstream.com)
Subject: ComputerWorld on Data Mining in a vicious circle
In a June 30, 1997 ComputerWorld article, Michael Schrage
argues that Data Mining runs the risk of not generating anything
useful, except for demand for more data mining.
Here is an excerpt:
Michael Schrage:
...
Think about it. Data mining virtually guarantees full employment gainful or
otherwise for MIS folks and MBAnalysts. Consider this perfectly plausible
scenario: A global company commits to using the latest and greatest
data mining algorithms to identify significant correlations in the
customer service and profitability arenas. Ninety days later, the
miners unearth no fewer than 23 statistically significant patterns. An
even dozen of them are potentially actionable.
What do you think the rational, well-managed organization will do? The answer
is obvious: It will try to identify the underlying dynamics of those
correlates. That means the organization has to go out and gather more
data and then process it into information. To mix metaphors, the
fruits of data mining literally become the grist for the information
mills of the enterprise. It's a vicious circle: Data
mining insights demand more data that must be mined for confirmation that
becomes part of future databases to be quarried. Such a deal!
...
I agree that there is such a danger, but we, the data miners, should
strive to produce some tangible results, or the well-managed
oragnization will notice quickly that the king is naked (to quote
from a well-known children's story).
Previous2NextTop
Date: Thu, 31 Jul 1997 15:58:15 +0200 (MET DST)
From: Rob Engels (ren@aifb.uni-karlsruhe.de)
Subject: ICML97 'Workshop on ML applications' summary available
Hello everybody,
In case you were at our workshop we once again want to thank you for your
presence, otherwise we hope that you will find the proceedings and
workshop summary helpful for your research. We are looking backward on a
very stimulating and interesting workshop with good discussions.
Previous3NextTop
Date: Tue, 5 Aug 1997 14:43:22 -0700 (PDT)
From: 'John R. Koza' (koza@CS.Stanford.EDU)
Subject: Evolvable Hardware and GP
PAPER NOW AVAILABLE IN POST SCRIPT...
'Rapidly reconfigurable field-programmable gate arrays for
accelerating fitness evaluation in genetic programming'
A late-breaking papers from GP-97 conference.
ABSTRACT:
The dominant component of the computational burden of
solving non-trivial problems with evolutionary algorithms is the
task of measuring the fitness of each individual in each
generation of the evolving population. The advent of rapidly
reconfigurable field-programmable gate arrays (FPGAs) and the
idea of evolvable hardware opens the possiblity of embodying
each individual of the evolving population into hardware for the
purpose of accelerating the time-consuming fitness evaluation
task This paper demonstrates how the massive parallelism of the
rapidly reconfigurable Xilinx XC6216 FPGA can be exploited to
accelerate the computationally burdensome fitness evaluation
task of genetic programming. The work was done on Virtual
Computing Corporation's low-cost HOTS expansion board for
PC type computers. A 16-step 7-sorter was evolved that has two
fewer steps than the sorting network described in the 1962
O'Connor and Nelson patent on sorting networks and that has
the same number of steps as the minimal 7-sorter that was
devised by Floyd and Knuth subsequent to the patent.
John R. Koza
Forrest H Bennett III
Jeffrey L. Hutchings
Stephen L. Bade
Martin A. Keane
David Andre
Published in
Koza, John R. (editor). Late Breaking Papers at the Genetic
Programming 1997 Conference, Stanford University, July 13-16,
1997. Stanford, CA: Stanford University Bookstore. Pages 121 �
131.
John R. Koza
Computer Science Department
258 Gates Building
Mail Code 9020
Stanford University
Stanford, California 94305 USA
E-MAIL: Koza@CS.Stanford.Edu
Office Phone: 650-723-1517 (Note new area code of 650)
Home Phone: 650-941-0336
Fax: 650-941-9430
WWW:
Previous4NextTop
Date: Mon, 21 Jul 97 16:45:12 EDT
From: 'Se June Hong (8-862-2265)' (HONG@watson.ibm.com)
Subject: Special issue of FGCS on Data Mining
GUEST EDITORIAL: Data Mining
Se June Hong
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
The ever-increasing quantity of data in every computing environment
presents both an opportunity to extract useful information and a
challenge to process the massive volume of data effectively.
Analyzing and generating models from data used to be in the domain of
classical statistics. During the past few decades the
pattern-recognition and machine-learning communities have greatly
expanded their areas of application and the kind of information to be
extracted, as well as the variety of models. The database community
joined the endeavor in the early 90s and a new multi-disciplinary
field began, which we now call data mining. The term KDD (Knowledge
Discovery in Databases) refers to a broader process of collecting and
cleansing the data, extracting the useful information (data mining),
and presenting and embedding the information in a decision support
application.
This new field is growing vigorously, due in large part to the
increasing awareness of the potential competitive business advantage
of using such information. Important knowledge has been extracted
from massive scientific data as well. Numerous conferences and
journals are addressing data mining issues or specializing in them.
And since data mining emphasizes the ability to deal with massive
data, high performance algorithms, parallel computation and effective
access to disk resident data (a concern of large database systems) all
become more relevant and essential: it is timely to introduce this
field to the readership of FGCS.
What is useful information depends on the application. Of course,
each record in a data warehouse full of data is useful for daily
operations, as in on-line transaction business, and for traditional
database queries. Data mining is concerned with extracting more
global information that is generally the property of the data as a
whole. Thus the diverse goals of data mining algorithms include
clustering the data items into groups of 'similar' items, finding an
explanatory or predictive model for a target attribute in terms of
other attributes, finding frequent patterns and sub-patterns that
co-occur with an associated sub-pattern, and finding trends,
deviations, and 'interesting' relations between the attributes. In
this special issue, we address the three most common data mining
tasks: Clustering, modelling, and finding frequent association
patterns of items. These are also the areas that are most readily
used in decision support applications.
The first paper on the promise and challenges by Fayyad and Stolorz is
a perspective introduction to this special issue based on their
pioneering personal experience. The last paper by Uthurusamy,
Soparkar, Szaro and Dunkel on the systems aspects of data mining
concludes this special issue with a reality check based on
considerations for the practical use of data mining techniques.
In the second paper, Hosking, Pednault and Sudan discuss the evolution
of statistical insights on modelling from the classical approaches
(mostly parameter fitting to a given model family) to the new
statistical learning theory (based on VC dimension) and computational
learning theory. Statistical learning theory deals with the trade-off
between the complexity of the model and the defined loss function of
the prediction such that selection of an appropriate model family can
be an integral part of model construction. Computational learning
theory identifies the learning tasks that can be PAC (probably
approximately correct) learnable with given computational complexity.
These new insights are beginning to be adapted to common model
families such as rules, trees and neural networks.
The next two papers deal with clustering problems, also known as
unsupervised learning. Customer segmentation is a widely recognized
application area in business. Since grouping 'similar' data elements
together begs the question of the purpose for which they are similar,
there are many clustering approaches, which depend on various notions
of the similarity between data elements expressed in terms of the
attribute values associated with the data elements. Michaud discusses
these techniques and argues for a relatively new approach based on the
theory of voting (i.e. each attribute votes that same valued elements
belong in the same cluster). Evaluating the resultant clusters is
difficult in the absence of a formally defined purpose of the
application. Zait and Messatfa present a benchmark-style comparison
of some major clustering techniques using artificially generated data,
which gives some idea as to what computing resources they require and
how they behave in different clustering situations.
In the next paper, Srikant and Agrawal introduce association rules and
present a new technique for finding generalized association rules.
Given a large quantity of point-of-sale (POS) data, for instance, one
would naturally like to know what items are frequently sold together
(frequent item set) and, among them, what subset implies the remaining
subset with high confidence level (association rules). This kind of
information is even more useful when the POS data is augmented with a
taxonomy hierarchy (e.g.; jackets and ski pants are outerwear;
outerwear and shirts are clothes; shoes, sneakers and boots are
footwear). One can then automatically find relations between classes
of items, e.g. that most of the time clothes are sold some footwear is
also sold. This is an example of a generalized association rule. In
practice, the problem of finding such frequently occurring patterns
requires that gigabytes of data can be processed efficiently.
The next set of three papers address classification and regression
problems. Kononenko and Hong discuss the need for selecting essential
attributes, various measures of the strength of an attribute for
modelling purposes, and ways to select attributes for a given
modelling situation. Apte and Weiss discuss key ideas for generating
rules and trees, perhaps the most popular model family for
classification and regression. Craven and Shavlik present another
popular model family, neural networks, with particular emphasis on
generating 'understandable' rules using a neural-network approach.
Neural networks are well established for predictive modelling in many
areas of application, but lack of human comprehensibility of the
networks made them unsuitable in some, and these rules may complement
and give an insight into the underlying neural network model.
Although the impetus for the birth of data mining came mainly from the
rapidly increasing size of current databases and commercial interest
in utilizing the information hidden in them, data mining is concerned
with issues broader than just dealing with large volumes of data. In
the short history of data mining, it has already been shown that
synergy between different disciplines has been fruitful in advancing
the art of extracting useful information from data. Data mining
algorithms must scale up to handle large quantities of data, but that
is not the same as insisting that we throw away sampling techniques
and algorithms that are not linear in the number of examples, if the
utility of the results can be improved by using them. There are some
applications where the comprehensibility of the extracted model is of
prime importance, but this does not preclude the usefulness of more
accurate models that may not be easily 'understandable' by an end
user. Data mining is all-of-the-above in these respects as well. It
is an important core area within the larger framework of the KDD
process, and it in turn challenges the future generation of computing
techniques and computer systems. Further background on data mining
can be found in ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING,
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy Editors,
AAAI Press / The MIT Press, 1996. The KDNUGGETS web site, at
is an excellent source of news on data
mining and KDD.
My editorial goal for this special issue was to cover and introduce
the key ideas of data mining in a balanced perspective. These are not
review papers; accordingly, authors were strongly urged to restrict
the references to those that are essential for the ideas conveyed and
also those that point to further references. I solicited authors who
have contributed new techniques in their respective areas, and asked
for papers that offer new insights rather than new techniques. I
underestimated the time needed to prepare such a paper by very active
and busy authors by more than six months. I thank the authors for the
quality they delivered. And I thank the FGCS editor-in-chief, Prof.
L.O. Hertzberger, and the editorial staff for their encouragement for
the idea of a special issue on data mining and for their patience.
Finally, I would like to express my gratitude to Dr. J. Hosking for
the cheerful help he provided with many editorial tasks.
--- also one of the abstracts for the FGCS special issue (nuggets 97:23)
was missing some lines. Here is the correct version:
FGCS (Future Generation Computer Systems), volume 13, Number 2, Oct 1997
Special Issue on Data Mining
Attribute Selection for Modelling
I, Kononenko, S.J. Hong
Modelling a target attribute by other attributes in the data is
perhaps the most traditional data mining task. When there are many
attributes in the data, one needs to know which of the attribute(s)
are relevant for modelling the target, either as a group or the one
feature that is most appropriate to select within the model
construction process in progress. There are many approaches for
selecting the attribute(s) in machine learning. We examine various
important concepts and approaches that are used for this purpose and
contrast their strengths. Discretization of numeric attributes is
also discussed for its use is prevalent in many modelling techniques.
MineSet 2.0 was announced and demonstrated at DCI on July 29 1997.
It will also be shown at the KDD conference Aug 14-17.
MineSet is a fully integrated, comprehensive suite of easy-to-use
analytic and visual data mining tools. To further revolutionize the
corporate decision support process, MineSet 2.0 provides powerful new
visual, analytical, statistical and Web launching tools for a wide
range of business analysis, reporting and planning applications across
the enterprise.
The new MineSet version seamlessly integrates data access, data
transformation, and analytic and 3D visual data mining. It also
supports direct access to Oracle(R), Informix(R) and Sybase(R)
databases as well as flat files.
New major features in MineSet 2.0 include:
* A new visual tool that enables users to display scatterplots for
data sets with a large number of points
* A new drill-through function that allows direct data selection
from visualization as input for further analysis
* New analytic capabilities that considerably expand the scope of
applicability of MineSet such as record scoring and weighting, loss
matrices, lift curves and learning curves
* A new classification model, Option Trees, that can dramatically
improve the accuracy and understanding of resulting models
* New data handling facilities for improved performance such as
record sampling and binary file management
* A new statistical reporting tool that can be used for any data
subset
* A new utility that allows for exchange of files between MineSet
and SAS
MineSet 2.0 will be available in August from Silicon Graphics and
its distribution channels worldwide. MineSet is available on all
Silicon Graphics systems running the IRIX(TM) 6.2 operating system or
higher.
MineSet functionality can also be made available on PCs running
Hummingbird Communication's Exceed 3D and other UNIX(R) X servers
supporting the industry-standard OpenGL(R). Evaluation copies
of Hummingbird's Exceed 3D will be shipping with MineSet 2.0. See
Previous6NextTop
Date: Thu, 31 Jul 1997 20:17:06 -0500
From: 'J.P.Brown' (jpbrown@hal-pc.org)
Subject: Some results from KDD Database on Miles per Gallon.
Struggling with Windows NT, I forget to let people know that I had put
some results of SuperInduction analysis of a KDD Database in my website
. There have been quite a few hits in the
interim, so it has not been ignored. Please note that I have included a
URL for the Complete Version in the Basic Version.
Some of the concepts presented have just emerged from the chrysalis. To
me, a big part of the charm of KDD is that new ideas can be submitted
without having to go through the meat-grinder of peer review.
Previous7NextTop
From: 'Brian W. Bush' (bwb@lanl.gov)
Subject: JOB: Graduate Research Assistant at Los Alamos Nat'l Lab
Date: Tue, 5 Aug 1997 10:29:53 -0600
The TRANSIMS (TRansportation ANalysis SIMulation System) project
at Los Alamos National Laboratory (LANL) is looking for a graduate
research assistant familiar with one or more of the following
fields: information theory, symbol dynamics, data sampling,
statistics, data mining, or pattern recognition/classification
techniques. We have an ongoing research effort to develop methods
for tracing the flow of information in simulations, for evaluating/
comparing simulation sampling procedures, and for extracting
traffic features (such as jams, incidents, flows, phase failures,
etc.) from simulation output data. Experience with C++ or object-
oriented programming would also be valuable.
This is a one-year appointment, with the possibility of renewal for
additional years. Applicants must meet the LANL GRA program
eligibility requirements (see
or inquire at mailto:progsinfo@lanl.gov).
Los Alamos National
Laboratory, an equal opportunity employer, is operated by the
University of California for the U.S. Department of Energy. Please
send resumes to:
Brian W. Bush
Energy and Environmental Analysis Group
TSA-4, Mail Stop F604
Los Alamos National Laboratory
Los Alamos, NM 87545 USA
505-667-6485 (voice)
505-665-5125 (fax)
mailto:bwb@lanl.gov
(email)
Wednesday 25th March - Friday 27th March 1998, London, UK
PADD98 - The Second International Conference and Exhibition on the
Practical Application of Knowledge Discovery and Data Mining is a new
conference that aims to demonstrate the use of this key technology for
solving real-world problems in business, industry, and commerce.
PADD98 will provide a rich blend of tutorials, invited talks, refereed
papers, panel discussions, a poster session, social agenda and a full
industrial exhibition. The result is an ideal forum for the exchange
of ideas and knowledge, between experts from a broad spectrum of
industries and technologies.
Call for Participation
Vast amounts of data are being collected by organisations. KDD
techniques are used to extract and transform hidden information into
valuable knowledge through the discovery of relationships and
patterns. Business processes are improved and solutions found to
problems.
The latest research suggests that firms who invest in setting up a
data warehouse and the software to mine it can expect a high return on
their investment. As databases begin to permeate virtually all
aspects of information storage, from e-mail systems to web servers, we
expect this return to increase dramatically.
It is quickly being recognized as an essential business intelligence
tool....a necessary ingredient to discovering the information
necessary to improve a company's market presence and differentiate
their products and services in today's global marketplace.
With the rapid advance in data capture, transmission and storage,
large systems users will increasingly need to implement new and
innovative ways to use the knowledge hidden in their data. A wealth
of potential business opportunities are available through the use of
this technology.
PA EXPO98
PADD will form part of a five day Practical Application Expo which
will also include: PAP/PACT98-Incorporating The Practical Application
of Prolog and The Practical Application of Constraint Technology
PAAM98-The Practical Application of Intelligent Agent and Multi Agent
Technology PAKeM98-The Practical Application of Knowledge Management.
You are invited to register your interest for PADD98 by completing the
reply form (see the web site)
Call for Papers
We invite you to submit a paper or industrial report describing fielded
applications which exploit KDD technology and which emphasize the following
aspects:
* Actual business benefits and business problems addressed
* Either innovative KD and DM techniques applied to standard domains
or significant new applications of standard techniques
* Issues and methods of resolution to get the application
implemented and deployed
* why KDD was appropriate
* How benefits are measured
Papers can be of any length, up to a maximum of twenty pages, and on
virtually any KDD related topic.
Call for Exhibitors
The conference also provides an opportunity for software vendors and
developers to demonstrate KDD systems. You are invited to contact the
organiser to arrange for your application to be exhibited at the event.
Dates:
Submission Deadline: December 5th, 1997
Notification: January 12th, 1998
Final Papers due: February 13th, 1998
Invited Speakers:
Jiawei Han (ACSys Keynote Speaker, Simon Fraser University)
Chris Wallace (Monash University)
The Second Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD-98) will provide an international forum for the sharing
of original research results and practical development experiences
among researchers and application developers from different KDD
related areas such as machine learning, databases, statistics,
knowledge acquisition, data visualization, software re-engineering,
and knowledge-based systems. It will follow the success of PAKDD-97
held in Singapore in 1997 by bringing together participants from
universities, industry and government.
*************** I m p o r t a n t D a t e s ***************
* 4 copies of full papers received by: October 16, 1997 *
* acceptance notices: December 22, 1997 *
* final camera-readies due by: January 30, 1998 *
*************************************************************
------------------------------------------------------------------------------
FINAL PROGRAM
MONDAY, 15 SEP 1997
9:50 Welcome
10:00 - 13:00
Saso Dzeroski and Nada Lavrac:
Introduction to ILP and its applications
(Coffee break 11:20 - 11:40)
13:00 - 14:30 Lunch
14:30 - 18:00
Stefan Wrobel and Tamasz Horvath:
ILP for KDD
Hands-on exercises with KEPLER, an integrated KDD tool
(Coffe break 16:00 - 16:30)
18:15 Meeting of the End-user-club of the
Inductive Logic Programming II Project
TUESDAY, 16 SEP 1997
09:00 - 12:30
Stephen Muggleton and Ashwin Srinivasan:
Explanatory ILP and its applications
Hands-on exercises with PROGOL
(Coffee break 10:30 - 11:00)
12:30 - 14:00 Lunch
14:30 - 18:00
Luc De Raedt, Hendrik Blockeel, Luc Dehaspe, and Wim Van Laer:
Descriptive ILP and its applications
Hands-on exercises with CLAUDIEN, ICL, and TILDE
(Coffe break 16:00 - 16:30)
WEDNESDAY, 17 SEP 1997
09:00 - 12:30
Usama Fayyad: KDD and Data Mining - Overview and Methods
(Coffee break 10:30 - 11:00)