KDD Nuggets Index


To KD Mine: main site for Data Mining and Knowledge Discovery.
To subscribe to KDD Nuggets, email to kdd-request
Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets


Data Mining and Knowledge Discovery Nuggets 96:18, e-mailed 96-06-04

Contents:
News:
* GPS, Computerworld: Red Brick in to data mining market 6/3/96,
http://www.computerworld.com/search/AT-html/9606/960603SL23brick.html
* P. Domingos, Newsweek and Wired June 1996 on Data Mining
* GPS, What's New at KD Mine (May 1996)
http://info.gte.com/~kdd/what-is-new.html
Publications:
* R. Kohavi, Test drive different algorithms on your problem,
ftp://starry.stanford.edu/pub/ronnyk/mlc96.ps.Z
* L. Breiman, paper on BIAS, VARIANCE AND ARCING
ftp://ftp.stat.berkeley.edu/users/breiman
* L. Breiman, Distribution Based Trees Are More Accurate,
ftp://ftp.stat.berkeley.edu/users/breiman/DB-CART.ps
Siftware:
* A. Zighed, SIPINA_W version 1.3,
http://eric.univ-lyon2.fr/eric.html
Positions:
* M. Stonebraker, (DBWORLD) Informix opportunities
Meetings:
* S. Anand, Workshop on Data Mining at Basel, Oct 1996
http://iserve1.infj.ulst.ac.uk:8080/pakm96.html

--
Data Mining and Knowledge Discovery community,
focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'From the moment I picked your book up until I laid it down I was convulsed
with laughter. Some day I intend reading it.'
- Groucho Marx (1895-1977)


Previous  1 Next   Top
>~~~News~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 04 Jun 1996 11:57:45
From: Gregory Piatetsky-Shapiro (gps0@gte.com)
Subject: Red Brick digs in to data mining market 6/3/96
X-Url: http://www.computerworld.com/search/AT-html/9606/960603SL23brick.html


Red Brick digs in to data mining market



Dan Richman

06/03/96



Red Brick Systems, Inc. next Monday will become the first
relational database management system vendor to announce integrated
data mining.

Red Brick Data Mine, which will be built in to Version 5.0 of Red
Brick Warehouse, will let users engage in categorization analysis, a
form of data mining that deals with the effect of unknown variables on
outcomes.

Warehouse 5.0 is set to ship by Dec. 1. It will cost at least
$15,000 to buy the license that activates Data Mine, which is built on
technology licensed from DataMind, Inc. in Redwood City, Calif.



A categorization-analysis query by a telecommunications company
might ask, 'What are the characteristics of customers who switch
long-distance carriers?' The response might list factors not
anticipated by the user, such as spending more than $75 per month on
long-distance service, subscribing to a calling plan and living in a
large city.

In contrast, a conventional, nondata-mining query asks questions to which the answers are foreseeable, such as, 'For customers lost to a competitor, what was the average monthly spending on long distance?'

Users will be able to run traditional and data-mining queries
interchangeably against data stored in Warehouse 5.0. No other
relational DBMS offers that capability, said Brian Murphy, a senior
analyst at The Yankee Group in Boston.

'Being able to do data mining alongside conventional queries
would add a valuable piece to our arsenal of analytical tools,' said
Bob Chin, chief information officer at Healthsource, Inc., a
managed-care company in Hookset, N.H.

'DataMind's technology alone was attractive, but an integration
with Warehouse, where we have half a terabyte of medical claims and
practice-pattern data, would be compelling,' he said.


Previous  2 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Subject: Re: Data mining story in Wired 4.06
Date: Wed, 29 May 1996 22:41:15 -0700
From: 'Pedro M. Domingos' (pedrod@ruffles.ICS.UCI.EDU)

This week's Newsweek has a story on banking
that highlights the role of data mining in the major changes taking
place. It's one of the strongest pieces of evidence on the practical
impact of data mining I've yet seen. The one-but-last paragraph,
showing some of the less-obvious but crucial things that have been
discovered, is especially interesting. I read it at the airport,
though, and I haven't gotten hold of the issue yet.

also, here is the exceprt from Wired.

THE BRAIN AND THE BADGE

by Taras Grescoe

Wired 4.06 (June 1996)

Want to finger a bad cop? Sometime this summer, the Chicago Police
Department will begin using a crude neural network program to do just
that. BrainMaker Professional will sort through the files of more than
12,000 officers and spew out a list of cops whose records suggest
they're drinking too much, shaking down store owners, or otherwise
screwing up.

In test runs, Internal Affairs says BrainMaker picked up subtle
patterns that eluded experienced supervisors. IA claims it would take
a staff of 30 investigators to get the same results. The software
costs US$795.

Hundreds of forces - including Detroit's and even Amsterdam's - have
expressed interest. Looks like the Mark Fuhrmans of this world can
run, but they'll have a harder time hiding.


Previous  3 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Wed, 29 May 1996 18:24:25 -0400
From: gps@gte.com (Gregory Piatetsky-Shapiro)
Subject: What is new at KD Mine

May 29, 1996

In Corporate page,

In Meetings page,

In Siftware page,

In Other servers page,




May 20, 1996


In Siftware page,
In Other servers page,



May 15, 1996


In Corporate page,



May 13, 1996


In Corporate page,

In Meetings page,


In Siftware page,



Previous  4 Next   Top
>~~~Publications:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 30 May 1996 21:37:31 -0700
From: Ronny Kohavi (ronnyk@starry.engr.sgi.com)
To: KDD list (kdd@gte.com)
Subject: Test drive different algorithms on your problem


Recently we have seen many claims (sometimes seemingly contradictory) such as:

1. Very simple classification rules perform well on most commonly
used datasets (Holte, 1993).
2. There is no free lunch. No algorithm can perform no better than
any other on average if all targets are equiprobable (Wolpert, 1994).
3. 'It cannot be emphasized enough that no claim whatsoever is being
made in this paper that all algorithms are equivalent *in practice*,
in the real world. In particular, no claim is being made that one
should not use cross-validation in the real world' (Wolpert, 1994).
4. Generalization is a zero-sum enterprise; for every performance gain in
some subclass of learning situations there is an equal and
opposite effect in others (Schaffer, 1994).
5. 'Rules are More than Trees' (Parsaye, slide title in the VLDB summit,1996).
6. In your paper, you should have compared with C4.5rules and not C4.5,
since it is known to be *much* better (naive reviewer).
7. '[Converting trees to rules] leads to a production rule classifier
that is usually about as accurate as a pruned tree...' (Quinlan, 1993).
8. Decision-trees induction can easily be made asymptotically Bayes optimal
(Gordon and Olshen, 1978, 1984).
(BTW, no such claim has ever been made for decision rule induction.)
9. Nearest neighbors are Bayes optimal. Asymptotically no algorithm
can do better (Fix and Hodges, 1951).
10.Neural network and statistical methods do better in some areas
and Machine Learning procedures in others (Brazdil and Henery in
the Statlog book, p. 175).

We have recently done an experiment comparing 17 algorithms on 8 large
datasets at the UC Irvine repository. The comparison is similar to
the StatLog comparison with two major differences: all datasets are at
UCI, and all algorithms can be trivially run from MLC++ by setting a
few environment variables.

The following paper includes a description of the claims made above
and what we believe is the right way to proceed if you are only
interested in accuracy: try a few algorithms on your specific problem.
(In practice, of course, comprehensibility in the KDD cycle may be
much more important than initial accuracy.)

The paper is geared towards end-users not necessarily familiar with
machine learning. Some theoretical references are left to the
appendix.

Our results show that while there are no clear winners throughout (as
expected), 1R and T2 (very simple classifiers) were very poor
performers in general. A description of MLC++ and the actual
comparison (and how to repeat them) is provided in the following paper.


Data Mining using MLC++
A Machine Learning Library in C++

Ron Kohavi Dan Sommerfield James Dougherty
Data Mining and Visualization Platform Group
Silicon Graphics, Inc. Sun Microsystems
{ronnyk,sommda}@engr.sgi.com jamesd@eng.sun.com


The paper is available at ftp://starry.stanford.edu/pub/ronnyk/mlc96.ps.Z
or under publications off http://robotics.stanford.edu/~ronnyk

--

Ronny Kohavi (ronnyk@sgi.com, http://robotics.stanford.edu/~ronnyk


Previous  5 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: Leo Breiman (leo@stat.Berkeley.EDU)
Date: Wed, 29 May 1996 20:36:43 -0700
Subject: paper


Some of the kdd people may be interested in my paper titled

BIAS, VARIANCE AND ARCING

It shows how construct the world's best off-the-shelf
classifier and tries to explain why it works. The explanation
is not yet really satisfactory. The paper can be gotten from
the ftp machine ftp.stat.berkeley.edu under users/breiman with
the title

arcall.ps.Z

Here is the abstract:


ABSTRACT

Recent work has shown that combining multiple versions of unstable
classifiers such as trees or neural nets results in reduced test set error. To
study this, the concepts of bias and variance of a classifier are defined.
Unstable classifiers can have universally low bias. Their problem is high
variance. Combining multiple versions is a variance reducing device. One of
the most effective is bagging (Breiman [1996a] ) Here, modified training sets
are formed by resampling from the original training set, classifiers
constructed using these training sets and then combined by voting . Freund
and Schapire [1995,1996] propose an algorithm the basis of which is to
adaptively resample and combine (hence the acronym--arcing) so that the
weights in the resampling are increased for those cases most often
missclassified and the combining is done by weighted voting. Arcing is
more successful than bagging in variance reduction. We explore two arcing
algorithms, compare them to each other and to bagging, and try to understand
how arcing works.

Previous  6 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Sun, 2 Jun 1996 18:24:47 -0700
From: leo@stat.Berkeley.edu (Leo Breiman)

Greetings,

Could you put this on the kdd bulletin board, please.

The following paper is available on the ftp machine
ftp.stat.berkeley.edu in the directory
users/breiman/DB-CART.ps

DISTRIBUTION BASED TREES ARE MORE ACCURATE

Nong Shang Leo Breiman
School of Public Health Statistics Department
University of California University of California
shang@stat.berkeley.edu leo@stat.berkeley.edu

ABSTRACT

Classification trees are attractive in that they present a simple and easily
understandable structure. But on many data sets their accuracy is far from
optimal. Much of this lack of accuracy is due to their instability--small
changes in the data can lead to large changes in the resulting tree. This
instability is the reason that combining many trees by voting can lead to
dramatic decreases in test set error (Breiman[1995]). But combining trees
loses
the simple structure. To keep the simple structure and improve accuracy, a
way must be found to reduce the instability in the construction. If we knew
the true probability distribution of the inputs and outputs, then the splits in
the tree could be based on this distribution and give more accuracy then
the splits based on a finite data set. So we turn the tree procedure
around--instead
of basing the splits on the data, the data is used to estimate the input-output
probability distribution and the splits are then based on this estimate. We
give the details of this construction. The experimental results on a number of
well-known data sets indicate that this procedure has potential for producing
much more accurate trees.


Previous  7 Next   Top
>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 4 Jun 1996 11:50:06 +0200 (MET DST)
From: zighed@univ-lyon2.fr (abdelkader.zighed)
Subject: knowledge engineering tools : SIPINA_W version 1.3

Dear Gregory,

The new version (v1.3) of SIPINA_W is ready.
Could you please insert this description as a contribution for
the next diffusion of kdd Nuggets ?
Thank you very much
Best regards
D.A. Zighed
-------------SIPINA v1.3----------------------------
DESCRIPTION
----------------------------------------------------
This version contains several methods of induction graphs and some tools
to evaluate
the rules bases. We summarise the main modules here :

a - Import / Manipulation of data
-----------------------------
The SIPINA_W files (.DAT) have the ASCII format but you may import da ta directly
from databases with a dBase format (.DBF) or a Paradox format (.DB), or you may
export data from a Lotus format spreadsheet (.WKS).
Continuous data may be recoded by different contextual or non-contextual discretisation
methods:
* Chi-Merge [Kerber 1992],
* MDLPC [Fayadd & Irani 1992],
* FUSINTER [ZIGHED 1995],
* FUSBIN[ZIGHED,1996]
~~~~~~~~~~~~~~~~~~ (New)

b - Methods
-------
Several methods are implemented:
* CART [Breiman & al. 1984], complete program proposing two criteria (Twoing Rule,
Gini index), as well as the pruning algorithm;
* Elisee [Bouroche & Tenenhaus 1970], binary segmentation method using the Chi-2
criterion;
* ID3 [Quinlan 1979/1986];
* C 4.5 [Quinlan 1992], includes the pruning and the simplification of rules;
* Chi2-link: a method using the Chi-2 critical probability as selection criterion,
cf.Mingers 1987;
* SIPINA [ZIGHED 1985/1992]: generalisation of trees by induction graphs, including
dynamically the discretisation methods seen above.

c - Tests and Evaluation
---------------------
You may divide the data file into a learning sample and a test sample, and then you
execute the data processing on the first one, generate the rules, followed by the
validation on the second sample. But you may also activate a cross-validation where
the draw of a sub-sample may be either randomly or stratified.
You can also use a bootstrap procedure.
~~~~~~~~~~~~~~~~~~ (new)
The consequent rules on each analysis are saved in different bases. Rules can be saved
in pruction rules format or Prolog.
~~~~~ (new)

d - Automatic / Interactive Learning
---------------------------------
When using the automatic learning procedure you only have to choose the method and
execute the analysis. The interactive learning mode enables you to force the operations
to be executed (Split, Merge), as well as the variables used (surrogate split) on each
vertex. The vertex inspection makes it possible to visualise the available information on
the selected vertex: distribution of the classes, observations list, distribution function
on each variable, variables=92 power on competing splits.

e - Advanced manipulations of rules
-------------------------------

e.1. Extracting rules
The generation of rules consequent to the graph has been improved. From each
non-initial vertex it is now possible to produce prediction rules which can be
evaluated through :
* their error rate,
* their corresponding number of observations,
* an implication test based upon the Lerman statistic [Lerman,1981]
* and the gap-test based upon the Chi-square statistic.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(new)

e.2. Rules bases manipulation
The rules bases may evolve by the fusion of two or more bases; the user has the
possibility to input rules manually and to evaluate them by means of the data set.

e.3. Selection of the best rules by validation
During the application step of a rules base on a test or generalisation sample the
specification of the selection criterion for the competing rules may be altered (an
individual may respond to two rules, both having different conclusions; this is
mostly possible when executing a merge of rules bases). The criteria are:
* minimisation of the error rate,
* maximisation of a rule=92s number of individuals,
* maximisation of the Goodman index [1988],
* maximisation of the intensity of implication,
* new strategies such as bagging [Breiman 1996].
~~~~~~~~~~~~~~~~~~~~~~ (new)

e.4. Optimisation and Simplification.
The consequent rules of an induction graph may be optimised and simplified. The
applicable methods are:
* detection and elimination of recurring premises,
* use of a symbolic algorithm exploring the whole description domain,
* algorithm of Quinlan [1987]: a hill-climbing for search the minimum,
pessimistic error rate.

f - Technical Limitations
---------------------
The theoretical capacities of the software are:
* 16.384 attributes
* 2^32 - 1 cases

actually, the limitations are those of the computer.

g- Status
------
Shareware.

h - How you can get SIPINA-W v1.3 ?
-------------------------------
You can obtain SIPINA by ftp anonymous from :
eric.univ-lyon2.fr
/pub/sipina

i - Installation
------------
to install this version, You download LESIPINA.EXE. LESIPINA.EXE is a self-extracting file.
Copy it in a temporary directory, and execute. The installation file is SETUP.EXE.
Please, Click on OK when the soft ask you another disk.

j - Updated by
----------
Ricco Rakotomalala on 1996-06-March (rakotoma@univ-lyon2.fr)

k - Contact
-------
Prof. D.A. Zighed,
Organisation: University of Lyon2,
e-mail : zighed@univ-lyon2.fr,
Tel.: (33) 78 77 23 76,
Fax.: (33) 78 77 23 75,
Adress : E.R.I.C._Lyon bat.L
5 av. Pierre Mendes-France
69676 Bron Cedex
France.
WEB : http://eric.univ-lyon2.fr/eric.html

Previous  8 Next   Top
>~~~Positions:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Wed, 29 May 1996 22:35:56 -0500
From: mike@illustra.com (Mike Stonebraker)
To: dbworld@ricotta.cs.wisc.edu
Subject: (DBWORLD) Informix opportunities

Informix Corporation has appointed me the Chief Technology
Officer for the company and charged me with creating advanced
technology prototypes and products that will ensure the
competitiveness of the company through the next decade.

To achieve this goal, I have between 10 and 20 open positions
for computer scientists who want a low-to-moderate research,
moderate-to-big development position investigating leading edge
ideas and their application to the commercial DBMS marketplace.

I am looking for professionals with interests in many areas of
data base systems including data mining, visualization, advanced 4GLs,
data base design, storage management, rules systems, distributed
data bases, high transaction rate systems, and novel application areas.

If you are highly motivated, very bright and interested in turning ideas
into real code, then send me your resume, either electronically to
mike@informix.com or via paper to

Michael Stonebraker
Chief Technology Officer
Informix Corporation
4100 Bohannon Dr.
Menlo Park, Ca. 94025

Positions can be in Oakland, Ca., Menlo Park, Ca. or Portland, Or.


Previous  9 Next   Top
>~~~Meetings:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 30 May 96 09:40:04 PDT
From: 'Sarabjot S. Anand' (cbbr23@iserve2.infj.ulst.ac.uk)
Subject: RE: Workshop on Data Mining at Basel, Oct 1996

First International Conference on Practical Aspects of
Knowledge Management
October 30 - 31, 1996
Basel, Switzerland

Call for Papers

Workshop on Data Mining in Real World Databases
http://iserve1.infj.ulst.ac.uk:8080/pakm96.html

Data stored by organisations is the most untapped resource that the
organisation possesses today. Hidden in the data is knowledge vital for
businesses gaining a competitive edge and surviving in the present
competitive business environment. Knowledge extracted from the data may
be used as a basis of a knowledge based application for some business
process thus reducing the role of the expert in the knowledge based
system=92s development cycle recognised as its biggest bottleneck.

Data Mining converts the data acquired by organisations into knowledge or=

information that can be utilised by them to reduce costs and increase
profits. The type of businesses with a need for Data Mining solutions ran=
ge
from financial organisations like banks and building societies to retail =

super-stores to highly complex technological component manufacturers.
Data Mining has been used in directed sales campaigns, credit
assessment, stock market prediction, hazard forecasting, organisational
restructuring, fraud detection, evidence based medicine, shopping basket =

analysis, fault diagnosis in manufacturing among other applications.

The objective of the workshop is to address practical problems faced by
organisations when they are considering a Data Mining solution. The hope =

is that the workshop will provide guidance to managers in organisations
considering a Data Mining solution in aspects such as:
- What does Data Mining deliver that Interactive querying does not ?
- How does Data Mining affect the rest of the computerised business
solutions?
- Are there hidden costs in buying a Data Mining system?
- Does the underlying technology really make a difference to the end user=
?
- Is there any particular methodology that can be followed to ensure a
successful Data Mining solution?
- What facts need to be considered when acquiring a Data Mining solution?
- What can Data Mining provide that cannot be had from traditional
statistical techniques?
- Can Legacy and Unstructured databases be mined?

Paper Submission:
The main theme of the Workshop in application-oriented. Therefore, the
workshop organisers would like to see high quality application papers
submitted to the workshop. All submitted papers will be reviewed from a
view to ensuring quality and relevance to the theme of the workshop.

The workshop papers will be compiled into an informal proceeding
distributed at the conference. A published version may be produced later.=

Based on the quality of papers presented at the workshop, the workshop
organisers may undertake the publication of the presented papers in a
separate book or journal.

Please submit 3 copies of a short paper (maximum of 10 pages with 12pt.
font) providing clear indications of the application area, the real-world=

problem addressed and the overall success of the Data Mining application =

to:
Sarabjot S. Anand
Northern Ireland Knowledge Engineering Laboratory
Faculty of Informatics,
University of Ulster,
Newtownabbey,
Northern Ireland BT37 0QB (UK)
E-mail: ss.anand@ulst.ac.uk
Tel: 44 1232 366671
Fax: 44 1232 366068

Important Dates
20 June, 1996: Deadline for Paper Submission
21 July, 1996: Acceptance notification
31 August, 1996: Deadline for final versions of papers

Participation
Besides to people presenting a paper, the workshop is open to
practitioners interested in concretely applying Data Mining. Workshop
participants presenting a paper will, however, qualify for a reduced
conference fee. Refer to the main conference's general information below =

for participation details.

Demos
Software demos related to the workshop topics (but not necessarily to a
particular paper) are encouraged. Conference organizers will provide a
room where participants can give demos of their systems during lunch
breaks or at other times. Lunch and exhibition / demos will be in the sam=
e or
in adjacent rooms.

Workshop Organisers
Professor David A. Bell (Faculty of Informatics, University of Ulster)
Professor John G. Hughes (Faculty of Informatics, University of Ulster)
Sarabjot S. Anand (Northern Ireland Knowledge Engineering Laboratory,
University of Ulster)

-------------------------------------
Name: Sarabjot S. Anand
Affiliation: Northern Ireland Knowledge Engineering Laboratory
School of Information and Software Engineering,
University of Ulster, Shore Road, Co. Antrim
Northern Ireland BT37 0QB
Tel: 44 1232 366671
E-mail: ss.anand@ulst.ac.uk (Sarabjot S. Anand)
WWW Page: http://www.infc.ulst.ac.uk/staff/ss.anand
Date: 05/30/96
Time: 09:40:04=20
-------------------------------------


Previous  10 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~