KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Data Mining and Knowledge Discovery Nuggets 96:28, e-mailed 96-09-05

News:

* Y. Kodratoff, Report on KDD-96

* A. Tickle, Occam's Razor and data mining

* L. Mazlack, Berkeley Special Interest Group In Database Mining

* M. Smyth, Intensive Tutorial: Learning Methods for Prediction,

Classification,

http://www.ai.mit.edu/projects/cbcl/web-pis/jordan/course/index.html
Positions:

* D. Wolpert, Job openings at IBM net.Mining (Almaden)
Meetings:

* B. Gaines, CFP: AAAI Spring Symposium, AI in Knowledge Management,

Stanford University, March 24-26, 1997,

http://ksi.cpsc.ucalgary.ca/AIKM97/

--
Discovery community, focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I am not young enough to know everything.
--James Matthew Barrie
(today is the beginning of School year in the US)

Previous 1 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Mon, 26 Aug 1996 12:46:17 +0200
From: Yves.Kodratoff@lri.fr (Yves.Kodratoff@lri.lri.fr)
Subject: report on KDD96

[see http://www-aig.jpl.nasa.gov/kdd96/ page for KDD conference,
and http://www.aaai.org/Publications/Press/Catalog/KDD/han.html page
for proceedings. GPS]

A report on KDD'96 conference
Portland, OR, Aug. 1996

Yves Kodratoff

Let us begin with some figures. With over 500 attendees KDD'96 confirms the
growing interest for the field. Out of the 215 submitted papers, the PC
selected 42 for presentation (plus 30 posters), which gives an acceptance
rate around 20%. Many 'highly respected' conferences show a higher rate of
acceptance. Among these papers, some 60% of the submitted ones came from
outside the USA, as do 55% of the accepted ones (50% of the posters are
non-US-authored). This shows the fairness of a PC that could have easily
increased the rate of US acceptance since 80% of its members come from the
US. Among the 14 European accepted papers, 6 come from Germany and 4 from
Finland. There might be good European research that was not represented.
Nevertheless, these figures show well where the Americans think good KDD
research comes from.
Among the four invited papers (those all US), the one of Grinstein
underlined the importance of good visualization, the one of Ullman of good
DBMS, the one of Vapnik of good statistics. Vapnik shows that the
predictable risk is the sum of the empirical risk with a function of the
complexity of the function describing the examples. If this complexity is
too high, it might be that good experimental result yield in reality a bad
risk.

The organizers put the accepted papers in 10 subfields that can be seen
naturally as KDD subtopics: Combining DM and ML, DM applications,
Decision-Trees and rule induction, Learning, Probability and Graphical
models, Mining with noise and missing data, Pattern-oriented DM, Prediction
and Deviation, Scalability and Extensibility of DM systems, Spatial, Text,
and Multimedia DM, Systems for mining large DB.

What comes clearly out of this conference is that the wrongness of the idea
that KDD is 'just' plugging in a DB your favorite DM system (be it
statistcal or ML): the originality of the field starts emerging, as show
the papers that compose it.
In order to illustrate this point, let us classify the papers in a way
different from the one of KDD'96 PC.
There are 3 papers explaining generalities about KDD, 3 papers describing
large systems dealing with large DB, 8 papers describing scientific
problems specific to KDD, 12 papers describing a specific application
together with the technique developed for this application, and, finally,
16 papers that report scientific work at improving DM algorithms. These
last 16 could be said to be presentable to another conference, but only 5
of them treat of extensions of ML algorithms, while 10 treat of statistical
problems, and 1 of an extension of DB methods. I do not see any other
conference that could so well put together people of different cultures.

Generalities about KDD

Fayyad, Piatetsky-Shapiro & Smyth present a unifying framework for
KDD and DM. They give the following definitions:
-'KDD is the non-trivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in the
data. '
-'DM is the step of KDD that applies the discovery
algorithms to the data.' (YK comment: ML is thus one of the DM techniques).
They identify 7 DM tasks in KDD, against the classical 2 of ML. They
are: classification, clustering (as in ML), and then, regression,
summarization, dependency modeling, change and deviation detection.

[That paper also emphasized that KDD is a *process* of discovery,
with many steps, including feasibility analysis, data warehousing,
pre-processing, data mining, post-processing, interpretation, and integration
with other systems. The data mining step, which is the main focus of machine
learning research, frequently takes no more than 10-20% of the overall effort.
-- GPS]

Fayyad, Haussler and Stolorz present the specific problems of
applying KDD techniques in Science (as opposed to industrial applications)
that are treated by Piatetsky-Shapiro & al.

Systems dealing with large DB

This year, 3 particularly interesting systems have been chosen for presentation.
The one of the IBM KDD project in Almaden, QUEST, presented by
Agrawal & al.
The one of the Simon Fraser's KDD project, DBMiner, presented by
Han & al.
The one presenting KDD from the point of view of extending SQL
queries: DataMine, by Imielinski & al.

Scientific problems specific to KDD

Fulton & al. treat the problem of focusing for decision trees.
Mannila & Toivonen deal with the discovery of frequently occurring
episodes.
Zytkow & Zembowicz work on varying the granularity of the research
in KDD.
Arning & al. come again on deviation detection.
Engels presents a user guidance for using KDD tools.
Mannila & Toivonen (bis) give a definition of 'frequent sets'.
Wrobel & al. discuss principles and techniques of the multistrategy
approach to KDD.
Lagus & al. work on KDD in texts, with an application to WWW. They
call their technique 'self-organizing maps'.

Specific applications

Some papers focus so much on their application that they present algorithms
that are hardly generalizeable to other problems. Such are the papers of
Storloz & Dean (Application to earthquake detection), Czyzewski
(Application to audio data), Hofacker & al. and Wang & al. (Application to
macro-molecules), Pfitzner & Salmon (Interpretation of simulations.
Application to astronomy), Shek & al. (Interpretation of simulations.
Application to geosciences).
The paper of Fawcett & Provost is an application to fraud detection. They
combine various ML systems to solve their problem.
The paper of Ciesielsky & Palstra is an application to marketing. They work
on time series by a combination of neural nets (NN) and knowledge-Based
systems (KBS).
The paper of de la Iglesia & al. is an application to financial data. They
use genetic algorithms (GA).
The paper of Provan & Singh is a medical application using Bayesian learning.
The paper of Wirth & Reinartz is an application to automotive industry.
They show the importance of focusing on characteristic parts of the data.
The paper of Masand & Piatetsky-Shapiro is a marketing application in which
they try to maximize business pay-off. Their interesting general conclusion
is that pay-off is not usually linked to maximum accuracy.

Improving DM algorithms

+ improving ML algorithms
Chan & Stolfo, Kaufman & Michalski, and Tsumoto & Tanaka take into account
background knowledge (BK) during the inductive process.
Ittner works on the problem of feature creation for tree induction, and
Lakshminarayan & al. deal with missing data with a conjunction of Autoclass
and C4.5.

+ improving statistical algorithms
Two papers deal with clustering algorithm. Smyth, and also Ester & al. who
introduce a new 'distance' measure (density-based).
Domingos performs linear-time rule induction without loss of accuracy.
Feelders learns with missing data.
Kohavi discretizes continuous features.
Musick builds belief networks with NN.
Storlorz & Chew work with hidden Markov models.
Kontkanen & al. present a Bayesian approach to estimating the expected
predictive performance of a model.
Lange deals with problems with independence/dependence of data and uses NN.
Kohavi & al. show that naive Bayes on large DB does not scale as well as
decision trees. They propose a new method: hybrid decision tree + naive
Bayes.
+ improving DB algorithms
Shen & Leng propose a generator of metaqueries.

Frankly, I think that many interesting results have been presented during
the poster session as well. We did not have time enough to discuss with all
the poster people, fairness prevents me from describing only the few ones I
had time to understand better. Look at the proceedings!
[see http://www.aaai.org/Publications/Press/Catalog/KDD/han.html page
GPS]

Previous 2 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Date: Fri, 23 Aug 1996 09:03:08 +1000 (EST)
From: Alan Tickle (tickle@fit.qut.edu.au)
To: kdd@gte.com
Subject: Occam's Razor and data mining
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Length: 2955

The following press release appeared in the news group comp.risks. Are
you aware of any comments?

Date: Sat, 3 Aug 96 09:44 EDT
From: 'Peter M. Weiss +1 814 863 1843' (PMW1@PSUVM.PSU.EDU)
Subject: Occam's Razor debunked

QUADNET, 2 AUG 1996
RAZOR THEORY LEFT IN SHREDS

A Deakin University academic has cast new light on a basic philosophical and
scientific problem that has been subject to debate for more than 2000 years.
And this has thrown doubt on the accuracy of many data analysis techniques
commonly used in business computing.

'Occam's razor, a principle dating back at least as far as Aristotle,
suggests that we should accept the simplest explanation consistent with all
the known facts. This previously untested principle is widely used in
current scientific practice,' said Dr Geoff Webb, of Deakin's School of
Computing and Mathematics in Australia.

Dr Webb has put this principle to the test, and found it wanting.

Occam's razor is a guiding principle in computers known as machine learning
systems, a form of artificial intelligence. Tasks commonly employed in
machine learning cover such diverse areas as medical diagnosis and
identification of glass fragments collected at the scene of an accident.

Dr Webb's research has found that when put into practice Occam's razor
doesn't work. 'The results are clear cut: Occam's razor is worse than blunt,
it is truly disposable,' he said.

To test the theory, Dr Webb modified a widely used machine learning system
that uses Occam's razor combined with a principle based on the assumption of
similarity. The modified version of the system abandoned Occam's razor and
relied solely on the principle of similarity. The theories developed by the
modified system were more accurate than the version that used Occam's razor.

Among the many users of machine learning systems are a new wave of computer
scientists calling themselves 'data miners'. These scientists use machine
learning systems based on Occam's razor to extract information from vast
quantities of data.

Data miners are employed by retailers to identify new customers; the
taxation office to identify tax fraud; by banks to help decide who should
receive loans; by stock brokers to select investments; and recently by
astronomers to identify 16 new quasars.

'Data mining seeks to extract information from data. By using Occam's
razor, data miners are potentially missing much of the information in the
data. That translates directly into missed business opportunities,' Dr Webb
warns.

'Occam's razor guides the user to look for simple explanations. But what
good are simple explanations of a complex world?' he said.

Dr Geoff Webb can be contacted on webb@deakin.edu.au

Issued by:
David Bruce, Media Manager, Deakin University, Australia
Phone: 61 3 9244 5268 Email: db@deakin.edu.au

David Bruce, Communications, Deakin University - 1995 University of the Year
db@deakin.edu.au (03) 9244 5268 (03) 9822 1379 fax

[ it is an interesting article,
but a single experiment done by Dr Webb, certainly does not 'debunk'
Occam razor. What it shows is that one can find datasets and
some assumptions where complex solutions will be better.
Similar results were shown by Cullen Shaffer in previous ML conferences.
Occam razor is not 'ALWAYS' true -- but is mostly true in most
real-world situations. GPS]

Previous 3 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 30 Aug 1996 11:12:56 -0700
From: mazlack@UC.EDU (mazlack@orodruin.cs.berkeley.edu)

BISC SPECIAL INTEREST GROUP IN DATABASE MINING - ANNOUNCEMENT

BISC (Berkeley Initiative in Soft Computing) is forming a special interest
group in Database Mining (BISC-DBM). This group is intended to focus on
soft computing approaches to database mining. The group is intended to be a
communication resource.

This group is not intended to be a general SIG for database mining; but,
rather focuses on tools that embrace the view that data and the extracted
results are implicitly imprecise

Database mining seeks to extract previously unrecognized information from
data stored in conventional databases. Database mining has also been called
'database exploration' and 'knowledge discovery in databases'.

Lawrence Mazlack has agreed to be chairman of this special interest group.
If you are interested in joining this group, please contact him at
'mazlack@uc.edu', with copies to zadeh@cs.berkeley.edu and
'leem@cs.berkeley.edu'.

Previous 4 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: Marney Smyth (marney@ai.mit.edu)
Subject: Intensive Tutorial: Learning Methods for Prediction, Classification
Date: Wed, 4 Sep 1996 20:52:30 -0400 (EDT)

The short course by Geoffrey Hinton and Michael Jordan described
below still has openings for September 20-21 in Cambridge, MA.
We are also announcing a course for December 14-15 in Los Angeles,
CA. If you wish to attend either course, please complete the
Registration Form attached below, and return it as soon as possible.

**************************************************************
*** ***
*** Learning Methods for Prediction, Classification, ***
*** Novelty Detection and Time Series Analysis ***
*** ***
*** Cambridge, MA, September 20-21, 1996 ***
*** Los Angeles, CA, December 14-15, 1996 ***
*** ***
*** Geoffrey Hinton, University of Toronto ***
*** Michael Jordan, Massachusetts Inst. of Tech. ***
*** ***
**************************************************************

A two-day intensive Tutorial on Advanced Learning Methods will be held
on September 20 and 21, 1996, at the Royal Sonesta Hotel, Cambridge, MA,
and on December 14 and 15, 1996, at Loews Hotel, Santa Monica, CA.
Space is available for up to 50 participants for each course.

The course will provide an in-depth discussion of the large collection
of new tools that have become available in recent years for developing
autonomous learning systems and for aiding in the analysis of complex
multivariate data. These tools include neural networks, hidden Markov
models, belief networks, decision trees, memory-based methods, as well
as increasingly sophisticated combinations of these architectures.
Applications include prediction, classification, fault detection,
time series analysis, diagnosis, optimization, system identification
and control, exploratory data analysis and many other problems in
statistics, machine learning and data mining.

The course will be devoted equally to the conceptual foundations of
recent developments in machine learning and to the deployment of these
tools in applied settings. Case studies will be described to show how
learning systems can be developed in real-world settings. Architectures
and algorithms will be presented in some detail, but with a minimum of
mathematical formalism and with a focus on intuitive understanding.
Emphasis will be placed on using machine methods as tools that can
be combined to solve the problem at hand.

WHO SHOULD ATTEND THIS COURSE?

The course is intended for engineers, data analysts, scientists,
managers and others who would like to understand the basic principles
underlying learning systems. The focus will be on neural network models
and related graphical models such as mixture models, hidden Markov
models, Kalman filters and belief networks. No previous exposure to
machine learning algorithms is necessary although a degree in engineering
or science (or equivalent experience) is desirable. Those attending
can expect to gain an understanding of the current state-of-the-art
in machine learning and be in a position to make informed decisions
about whether this technology is relevant to specific problems in
their area of interest.

COURSE OUTLINE

Overview of learning systems; LMS, perceptrons and support vectors;
generalized linear models; multilayer networks; recurrent networks;
weight decay, regularization and committees; optimization methods;
active learning; applications to prediction, classification and control

Graphical models: Markov random fields and Bayesian belief networks;
junction trees and probabilistic message passing; calculating most
probable configurations; Boltzmann machines; influence diagrams;
structure learning algorithms; applications to diagnosis, density
estimation, novelty detection and sensitivity analysis

Clustering; mixture models; mixtures of experts models; the EM
algorithm; decision trees; hidden Markov models; variations on
hidden Markov models; applications to prediction, classification
and time series modeling

Subspace methods; mixtures of principal component modules; factor
analysis and its relation to PCA; Kalman filtering; switching
mixtures of Kalman filters; tree-structured Kalman filters;
applications to novelty detection and system identification

Approximate methods: sampling methods, variational methods;
graphical models with sigmoid units and noisy-OR units; factorial
HMMs; the Helmholtz machine; computationally efficient upper
and lower bounds for graphical models

#####

REGISTRATION - PLEASE PRINT OUT, FILL IN and RETURN BY MAIL.

Standard Registration: $700 / Student Registration: $400

Registration fee includes course materials, breakfast, coffee breaks,
and lunch on Saturday. Those interested in participating should return
the completed Registration Form and Fee as soon as possible.

Learning Methods for Prediction, Classification,
Novelty Detection and Time Series Analysis

September 20 - September 21, 1996
Cambridge, Massachusetts USA

Please complete this form (type or print)

Name ___________________________________________________
Last First Middle

Firm or Institution ______________________________________

Mailing Address (for receipt) _________________________

__________________________________________________________

__________________________________________________________

__________________________________________________________
Country Phone FAX

__________________________________________________________
email address

(Lunch Menu, Saturday September 21st - tick as appropriate):

___ Vegetarian ___ Non-Vegetarian

Fee payment must be made by MONEY ORDER or PERSONAL CHECK. All
amounts are given in US dollar figures. Make fee payable to
Prof. Michael Jordan. Mail it, together with this completed
Registration Form to:

Professor Michael Jordan
CBCL
Dept. of Brain and Cognitive Sciences
M.I.T.
E10-034D
79 Amherst St.
Cambridge, MA 02139 USA

DEADLINE: Registration before September 16, 1966. DO NOT SEND CASH.

#####

ADDITIONAL INFORMATION

A Registration Form, along with additional information, is also
available from the course's WWW page at

http://www.ai.mit.edu/projects/cbcl/web-pis/jordan/course/index.html

For further information contact: Marney Smyth
Phone: 617 258-8928
Fax: 617 258-6779
e-mail: marney@ai.mit.edu

Previous 5 Next Top

>~~~Positions:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Return-Path: (gps0@gte.com)
X-Authentication-Warning: relay.gte.com: postman set sender to (dhw@almaden.ibm.com) using -f
Date: Fri, 23 Aug 1996 16:56:37 -0700
From: (dhw@almaden.ibm.com) (David Wolpert)
To: kdd@gte.com
Subject: Job openings
Content-Length: 1932

*** Job Announcements. Please distribute. ***

The Web is currently dumb.

Join our team at IBM net.Mining; we are making the web intelligent.

We currently have immediate need to fill positions at our Almaden
Research Center facility in the south of Silicon Valley. net.Mining is
a sub-organization of IBM Data Mining Solutions, a rapidly expanding
group that also has openings (see recent postings). IBM is an equal
opportunity employer.

Scientific Programmers/

Responsibilities: Interact with the Machine Learning Researchers to
implement new web-based algorithms as code, verify the code, and test
the algorithms in real world environments. Must be able to work
independently.

Qualitifications: Bachelors or equivalent in computer science,
statistics, mathematics, physics, or an equivalent field. Higher
degree highly desirable. Extensive experience implementing numeric
code, especially in machine learning, statistics, neural nets, or a
similar field. Familiarity with college-level mathematics
(multi-variable calculus, differential equations, linear algebra,
etc.) 2 or more years experience with C/C++ in a research or
commercial environment. Knowledge of Internet technologies highly
desirable.

Machine Learning Researchers/

Responsibilities: Develop new algorithms applying machine learning and
associated technologies to the web. Develop new such
technologies. Work with the Scientific Programmers to implement and
investigate those algorithms and technologies in the real world.

Qualifications include: PhD or equivalent in computer science,
statistics, mathematics, physics, or an equivalent field, with an
emphasis on machine learning, statistics, neural nets, or a
similar field. Strong background in mathematics. Experience with
C/C++ highly desirable. Knowledge of information retrieval and/or
indexing systems, text mining, and/or knowledge of Internet
technologies, all highly desirable.

Previous 6 Next Top

>~~~Meetings:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Sat, 31 Aug 1996 02:36:18 -0700
From: Brian Gaines (gaines@cpsc.ucalgary.ca)
Subject: CFP: AAAI SS on Artificial Intelligence in Knowledge Management

CALL FOR PAPERS

AAAI Spring Symposium, Stanford University, March 24-26, 1997

Artificial Intelligence in Knowledge Management

Knowledge Management (KM) is a topic of growing interest to large
organizations.

It comprises activities focused on the organization acquiring knowledge from
many sources, including its own experience and from that of others, and
on the effective application of that knowledge to fulfill the mission of
the organization.

The knowledge management community has been eclectic in drawing from many
sources for its methodologies and tools. Typical approaches to the management
of knowledge are based on concept maps, hypermedia and object-oriented
databases. Techniques developed in artificial intelligence for knowledge
acquisition, representation and discovery are seen as relevant to KM.
However, there is as yet no unified underlying theory for KM, and the
scale of the problem in large organizations is such that most existing
AI tools cannot be applied in their current implementations.

The objective of this symposium is to bring together KM practitioners
and applied AI specialists from KA, KR and KDD, and attempt to formulate
the potential role of various AI sub-disciplines in knowledge management.

Submissions are requested from those with in-depth knowledge and experience
in AI topics relevant to knowledge management. Papers and presentations
should address the issues of the requirements and foundations for KM, the
applicability of existing AI theories, methodologies and tools to KM, and
the future development of KM in relation to AI. Of particular interest are
requirements analyses from those responsible for the development and
implementation of knowledge management systems.

Ongoing information on the symposium will be available through the web
at http://ksi.cpsc.ucalgary.ca/AIKM97/.

Submission Information

Potential attendees should submit either a paper (not exceeding 5000 words),
a summary of an ongoing development or lessons learnt, or a statement of
interest (not exceeding 1000 words). All submissions should be electronic
in postscript to ftp://ksi.cpsc.ucalgary.ca/incoming entitled by the
submitter's name with email to gaines@cpsc.ucalgary.ca giving the title,
authors, affiliation and abstract of the submission.

Organizing Committee

Rose Dieng (INRIA, France),
Brian R. Gaines (Co-Chair, University of Calgary, Canada),
Gertjan van Heijst (Kenniscentrum CIBIT, The Netherlands),
Dickson Lukose (University of New England, Australia),
Frank Maurer (University of Kaiserslautern, Germany),
Mark A. Musen (Co-Chair, Stanford University, USA),
Ramasamy Uthurusamy (Co-Chair, General Motors, USA).

Previous 7 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~