News: *
GPS, new Siftware, Companies, and References in KD Mine
info.gte.com/~kdd/what-is-new.html
*
T. Fawcett, Learning with skewed class distributions -- summary Publications: *
GPS, IEEE TKDE Special Issue on Data Mining, December 1996
--
Discovery in Databases (KDD) community, focusing on the latest research and
applications.
Submissions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to kdd@gte.com
To subscribe, email to kdd-request@gte.com
message with
subscribe kdd-nuggets
in the first line (the rest of the message and subject are ignored).
See
Nuggets frequency is approximately 3 times a month.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site
********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************
~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Many of us quote an eight-word
Ring out the old, ring in the new,
Ring out the false, ring in the true
Tennyson (thanks to Brij Masand)
Previous1NextTop
Date: Thu, 19 Dec 1996 15:33:05 -0500
From: gps@gte.com
(Gregory Piatetsky-Shapiro)
Subject: new Siftware, Companies, and References in KD Mine
Recent additions in KD Mine --
new Siftware, Companies, and References.
(see
info.gte.com/~kdd/what-is-new.html for all HTML links)
Siftware:
Classification:
Multiple approaches:
* updated entry for
SIPINA-W version 2.0 a shareware tool for Knowledge Discovery
in Databases. Version v2.0 contains several methods : CART,
ID3, C4.5, ELISEE, Chi2Aid, SIPINA,..
Decision-tree Approach:
* link to
Business Miner, desktop data mining tool that helps
non-technical business users automatically find previously
undetected relationships in their business data.
* link to
Preclass, a tool for building classification trees from
large data sources.
Neural network approach:
* entry for ModelQuest, statistical/neural network tool,
replacing an entry for AIM -- a discontinued system from Abtech.
* entry for NeuralWorks Predict, a complete application development
environment for creating and deploying real-time applications for
forecasting, modeling and classification.
* entry for NeuralWorks Professional II/PLUS, a
comprehensive multi-paradigm neural network package.
Dependency Derivation:
* link to
AT-SIGMA Data Chopper, a tool
which attempts to discover relationships between database fields.
New Companies:
Data Mining Consulting
Cirrus Recognition,
developers of neural-networks based CirrusNet pattern recognition
technology
Knowledge Technologies, a group within Andersen Consulting,
focusing on data mining, knowledge discovery, and expert systems
Software and Services,
TRIADA, developers of Ngram
technology for compacting and quickly quering databases
Abtech Corporation, developers of
ModelQuest statistical/neural network tool.
ALTA Analytics, developers of NETMAP, an advanced visualisation
tool for knowledge discovery.
New References:
Divided References page into sections:
Definitions | Books | Business | Research | Journals
In Definitions, added (thanks to Thierry Van de Merckt)
Previous2NextTop
(I included the following posting from ML-list Vol. 8, No. 20 since it
is very relevant to many problems in Data Mining. In our experience,
most practical work for learning from data where one of the target
classes is very infrequent involves building a training set where
class distributions are approximately equal. GPS)
Date: Wed, 4 Dec 96 09:22:49 EST
From: Tom Fawcett (fawcett@nynexst.com)
Subject: Learning with skewed class distributions -- summary of responses
Previously on the ML List I asked about learning with skewed distributions,
where one class is more prevalent than another in an instance population.
I received many reponses, mostly from people who are facing this problem
too and wanted copies of any responses I got. Below is a summary. I'm
surprised more work hasn't been published on this problem considering its
prevalence in real world domains.
-Tom
===== From Johann Petrak (johann@ai.univie.ac.at):
> The CART book has a very short chapter on class priors and draws a
> connection to class misclassification costs (if we are concerned of
> misclassifying a certain protion of a rare class we can also try to deal
> with that by giving that class a higher misclassification cost).
>
> Another paper that comes to my mind is (Lewis & Cattlett): the main topic
> of the paper is not dealing with skewed class distributions, but it
> presents a simple approach of 'tweaking' C4.5 when one class is more
> frequent than the other in a two class problem. The method again is based
> on misclassification cost and uses a single parameter called 'loss ratio'.
>
> CART book:
> Breiman L., Friedman J.H., Olshen, R.A., Stone C.J.:
> Classification and Regression Trees,Wadsworth,Belmont,1984
> p. 112
>
> Paper:
> Lewis D.D., Catlett J.: Heterogeneous Uncertainty Sampling
> for Supervised Learning. Proc. 11th Intl.Conf.on Machine Learning,
> pp.148-156, 1994
===== From Jian Zhang (jian@cs.uregina.ca):
> In my Msc thesis (Automatic Learning of English Pronunciation Rules, 1995),
> I have done some similar work to analyze the importance of different
> attributes. I used three methods: reversed-left-skewed, left-skewed, and
> right-skewed with Iterated Version Space Algorithm. I have tested these
> methods on the NETalk Corpus (20,000 words). The results are shown on
> page 80-82 (Section 7.2.3) of my Msc. thesis. The thesis can be obtained
> from my home page (url:
===== From Joe McCarthy (jmccarth@stimpy.cs.umass.edu):
> Skewed distributions are pretty common in NLP problems. For coreference
> resolution, my data sets were typically about 20-25% positive instances.
> Stephen Soderland has seen greater skewing in his work (though his
> current focus is on covering algorithms). There is a special interest
> group on Natural Language Learning:
>
>
>
> I suspect you may find many other examples of such skewed
> distributions via that web page. However, I don't know how many NLP
> folks have developed ways of systematically dealing with such
> skewedness (although if you come up with solutions, I'm sure NLP folks
> would be very interested).
===== From Peter Turney (peter@ai.iit.nrc.ca):
> I've been working with very skewed class distributions. I'm working
> with a dataset that has about 300,000 samples of class 0 and 300
> samples of class 1. I haven't published any papers about it yet.
> I've found the following strategy works well:
>
> TRAIN/TEST DIVISION:
> - training data 150,000 class 0, 150 class 1
> - testing data 150,000 class 0, 150 class 1
>
> TRAINING:
> - use C4.5 to generate 50 decision trees
> - use bagging strategy
> - repeat 50 times:
> - randomly sample training data, sampling with
> replacement
> - gather 1000 samples of class 0 and 1000 samples
> of class 1
> - run C4.5 with options:
> -p = use soft-thresholds
> -c100 = minimal pruning
> -m1 = allow a branch to contain as
> few as 1 object
> - the justification for '-c100' and '-m1' is that
> bagging with 50 trees is already reducing variance,
> so it is not necessary to inject bias in order to
> reduce variance
>
> TESTING:
> - for each case, use 50 soft-threshold decision trees
> to calculate average probability of membership in class 1
> - rank order entire testing set by calculated probability
> - guess that top 150 ranked cases are in class 1, rest are
> in class 0
>
>
> Obviously accuracy is the wrong thing to measure when class distibution
> is highly skewed. With my data, guessing the most common class yields
> an accuracy of more than 99%. I considered using a cost matrix, but
> I have settled on using the geometric mean of precision and recall.
> See below.
[Turney included a forwarded e-mail msg which is too long to include here.]
===== From David Aha (aha@AIC.NRL.Navy.Mil):
> As one example, we designed IB3 (Aha/Kibler IJCAI-89) to be sensitive
> to skewed distributions. It's an incremental learner that tries to
> distinguish instances based on whether they have 'good' or 'bad'
> predictive accuracy on subsequently presented instances. Effectively,
> it compares two probabilities to do this. For each instance, we
> computed:
>
> 1. p(in-same-class | is-similar) ; accuracy on similar instances
> 2. p(in-same-class) ; concept frequency
>
> If (1) was significantly higher than (2), then the instance was deemed
> as 'good', and it was used in classification attempts. If (2) was
> significantly higher than (1), then the instance would be deleted from
> among those stored. If neither, then it would remain stored, but not
> used for prediction attempts.
>
> I wouldn't go so far as to say that IB3, or its extensions by me or
> other folks, is 'effective'. It's limited in many ways - not what
> you'd want in a commercial tool. However, this method of comparing
> accuracy with concept frequency *does* make it sensitive to skewed
> (concept) distributions...and just using accuracy alone fails
> miserably on those distributions.
===== From Piew Datta (pdatta@abp.ICS.UCI.EDU)
> Dennis Kibler and I have been looking at a similar problem. The skewed
> class distribution seems to occur in medical databases. There are always
> much fewer controls and they don't seem like a 'good' sample of the
> population. We are looking at methods for learning from a single class,
> that is, learning from positive examples only. Our approach is to use
> methods that use a nearest neighbor approach for classification and to
> learn thresholds for instances being classified as a member of the
> positive class. We have looked at modifications of nearest neighbor,
> naive bayes, and prototype algorithms. I have written a very rough short
> paper (unpublished) on our research and if you are interested I can email
> it to you.
Previous3NextTop
Date: Thu, 26 Dec 1996 11:00:36 -0500
From: Gregory Piatetsky-Shapiro (gps0@gte.com)
Subject: IEEE TKDE Special Issue on Data Mining - December 1996
IEEE Transactions on Knowledge and Data Engineering - December 1996
Vol. 8, No. 6, December 1996
SPECIAL SECTION ON MINING OF DATABASES
866 Data Mining: An Overview from a Database Perspective
Ming-Syan Chen, Jiawei Han, and Philip S. Yu
884 Finding Aggregate Proximity Relationships and Commonalities in
Spatial Data Mining, Edwin M. Knorr and Raymond T. Ng
898 A Metapattern-Based Automated Discovery Loop for Integrated Data
Mining -- Unsupervised Learning of Relational Patterns,
Wei-Min Shen and Bing Leng
911 Efficient Mining of Association Rules in Distributed Databases,
David W. Cheung, Vincent T. Ng, Ada W. Fu, and Yongjian Fu
923 Visualization Techniques for Mining Large Databases: A Comparison,
Daniel A. Keim and Hans-Peter Kriegel
Concise Papers in the Special Section
939 Extraction and Applications of Statistical Relationships in
Relational Databases, Wen-Chi Hou
946 An Efficient Inductive Learning Method for Object-Oriented Database
Using Attribute Entropy, Yueh-Min Huang and Shian-Hua Lin
952 Knowledge Discovery in Deductive Databases with Large Deduction
Results: The First Step, Chien-Le Goh, Masahiko Tsukamoto, and
Shojiro Nishio
957 Effective Data Mining Using Neural Networks,
Hongjun Lu, Rudy Setiono, and Huan Liu
962 Parallel Mining of Association Rules,
Rakesh Agrawal and John C. Shafer
970 What Makes Patterns Interesting in Knowledge Discovery Systems,
Avi Silberschatz and Alexander Tuzhilin
Previous4NextTop
Date: Fri, 20 Dec 96 14:17:29 CST
To: kdd@gte.com
From: 'Gerry McKiernan' (JL.GJM@ISUMVS.IASTATE.EDU)
Subject: Four-T-Nine-R(sm): Data Mining in Web and non-Web Databases
Content-Length: 1099
Four-T-Nine-R(sm)
Data Mining of Web and non-Web Bibliographic
Databases
For a planned review and clearinghouse, I am interested in
learning of projects, research, products and services that have
applied Data Mining technologies to Web or non-Web _bibliographic_
data bases or datasets. I am particularly interested in the
application to MARC data records.
At this time I am _not_ interested in other applications of
Data Mining.
There are a number of excellent Web resources available
to those who desire additional information about Data Mining.
One of the best is KD Mine at URL:
As always, any and all suggestions, reactions or critiques
are most welcome.
Regards,
Gerry McKiernan
Curator, CyberStacks(sm)
Iowa State University
Ames IA 50011
gerrymck@iastate.edu
'I Know It's In There Somewhere' Previous5NextTop
Sender: juffi@mail4.ai.univie.ac.at
Date: Fri, 27 Dec 1996 16:28:14 +0100
From: Johannes Fuernkranz (juffi@ai.univie.ac.at)
Organization: OFAI
To: ml@ics.uci.edu,
kdd@gte.com,
Ilpnet@ijs.si
Subject: CFP: AAI SI on ILP for KDD
Content-Length: 3789
1st Call For Papers
Applied Artificial Intelligence
Special issue on
First-Order Knowledge Discovery in Databases
Knowledge Discovery in Databases (KDD) is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable
patterns in data (Fayyad, Piatetsky-Shapiro & Smyth, 1996). Machine Learning
algorithms form the core of many KDD systems and applications. However,
standard inductive learning techniques are constrained to processing a
single relational table, whereas many real-world databases are structured
into several tables containing interrelated information. Learning algorithms
that are able to use representations in first-order logic, in particular
Inductive Logic Programming (ILP) algorithms, explicitly aim at exploiting
structured information. Thus KDD is a fruitful research and application area
for ILP.
A recent MLnet Workshop, held at the ICML-96, focussed on a discussion of
the potential contribution of ILP for KDD. Information on the workshop
including a short summary and all accepted papers can be found at
The general conclusion was that ILP can
be a valuable tool for data mining, its main advantages being the
expressiveness of first-order logic as a representation language and the
ability of many ILP systems to use strong language biases for restricting
the huge search space. ILP has a high flexibility in incorporating various
forms of background knowledge, which can be invaluable for large KDD tasks.
The special issue on 'First-Order Knowledge Discovery in Databases' of the
Applied Artificial Intelligence Journal will thus welcome papers that focus
on one or more of the following topics:
* Embedding ILP into the KDD process
* Necessary pre- and post-processing steps for real-world applications
* Interfacing ILP systems with database managers
* Scalability of ILP for real-world databases
* Criteria for quantifying the complexity of ILP problems
* Evaluation of gain and price of ILP versus propositional learning
* Non-classification learning and discovery in a first-order framework
* Benefits of using background knowledge and/or strong explicit biases
* Innovative real-world applications of ILP
Papers on related subjects are also welcome, but a strong focus on
applications and database issues is required for all submissions.
Submissions
Papers should be prepared according to usual standards for journal
submissions. The approximate length of a manuscript should be between 8,000
to 10,000 words. Final manuscripts of accepted papers will have to be
formatted according to the Instructions to Authors, which can be found in
all issues of the journal.
Authors have to submit four copies of their manuscripts to
* Johannes Fuernkranz
* Austrian Research Institute for Artificial Intelligence
* Schottengasse 3
* A-1010 Vienna
* AUSTRIA
* juffi@ai.univie.ac.at
or
* Bernhard Pfahringer
* Department of Computer Science
* University of Waikato
* Hamilton
* NEW ZEALAND
* bernhard@cs.waikato.ac.nz
whichever is more convenient.
Submission Deadline: April 30, 1997
Each paper submitted for publication will be judged by its originality,
adequacy of method, significance of findings, and relevance to the special
issue's subject matter. It should be as concise as possible, yet
sufficiently detailed to permit critical review. Each manuscript must be
accompanied by a statement that it has not been submitted or published
elsewhere. The authors of accepted papers will be asked to transfer the
copyright to the publisher.
Previous6NextTop
Date: Thu, 19 Dec 1996 17:47:59 -0800
From: Ellen Grace Henson (eghenson@heron.engr.sgi.com)
To: ml@ics.uci.edu.ai-stats@watstat.uwaterloo.ca.kdd@gte.com
Subject: Silicon Graphics MineSet 1.1
Content-Length: 1392
The December, 1996 Industry in Focus issue of 'Database Programming & Design'
cites Silicon Graphics as one of a 'Dozen Companies on the Rise' in part
because 'SGI is showing the industry an exciting new way
to experience data analysis.'
MineSet version 1.1 is the second release of Silicon Graphic's product
for data mining and exploratory data analysis. MineSet provides
an integrated environment with the following features.
- Data access: interfaces to commercial databases or flat files.
- Transformations: binning/discretization, defining new columns,
aggregations.
- Data mining algorithms: decision trees, evidence (simple Bayes),
associations, attribute importance.
- Visualization: either direct visualization of data as multidimensional
hierarchies, scatterplots, geographical information/maps, or visualizations
of mining results including 3D fly-throughs over decision trees
and evidence visualization.
MineSet enables you to gain a deeper, intuitive understanding of your
data by helping you to discover hidden patterns, important trends and
new knowlege. MineSet runs on any SGI platform; to handle large datasets,
we recommend you use Silicon Graphics' servers: Origin and Challenge.
Previous7NextTop
Date: Thu, 19 Dec 1996 19:37:27 +0001
From: NADA LAVRAC (Nada.Lavrac@ijs.si)
Subject: CFP for ILP-97, Prague
Content-Length: 2912
ILP-97
The Seventh International Workshop on
Inductive Logic Programming
17-19 September 1997, Prague, Czech Republic
General Information:
ILP-97 is the seventh in a series of international workshops on
Inductive Logic Programing. ILP-97 will be preceeded by a two-day
tutorial on ILP for KDD (15-17 September 1997) and followed by a
one-day meeting (20 September 1997) of the area 'Computational Logic
and Machine Learning' of the European Network of Excellence on
Computational Logic (COMPULOG). The Proceedings of ILP-97 will be
published by Springer.
Program:
The scientific program will include invited talks by Usama Fayyad and
Georg Gottlob, and presentations of accepted papers. Submissions are
invited that describe theoretical, empirical and applied research in
all areas of ILP. This includes, for example, results concerning
logical settings for ILP, learning in the context of higher-order
logics and constraint logic programming, as well as ILP systems that
use probabilistic techniques and heuristics. Contributions that
describe the use of ILP approaches in areas such as natural language
processing, knowledge discovery in databases, intelligent agents,
information retrieval, etc. are encouraged. Submissions describing
applications of ILP methods to real-world problems are especially welcome.
Program Chairs:
Nada Lavrac and Saso Dzeroski
J. Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia
Email: Nada.Lavrac@ijs.si,
Saso.Dzeroski@ijs.si
Phone: +386 61 177 3272 (Nada) or +386 61 177 3217 (Saso)
Fax: +386 61 125 1038 or +386 61 219 385
Program Committee:
F. Bergadano (Italy) H. Bostrom (Sweden) I. Bratko (Slovenia)
W. Cohen (USA) L. De Raedt (Belgium) P. Flach (Netherlands)
S. Matwin (Canada) S. Muggleton (UK) M. Numao (Japan)
D. Page (UK) C. Rouveirol (France) C. Sammut (Australia)
M. Sebag (France) A. Srinivasan (UK) S. Wrobel (Germany)
Submit full papers of max. 5000 words. Submissions should be sent to
the Program Chairs in 5 copies. A title page (including title,
authors, contact author's full address, email, phone, fax, as well as
abstract) must be sent by email to ilp97@ijs.si.
Submission deadline: 31 March 1997
Notification of acceptance: 31 May 1997
Camera ready copy: 16 June 1997
Workshop: 17-19 September 1997
Previous8NextTop
From: Trevor Hastie (trevor@Playfair.Stanford.EDU)
Subject: Modern Regression and Classification course - Hawaii
To: kdd@gte.com
Date: Thu, 19 Dec 1996 16:42:22 -0800 (PST)
Content-Length: 5115
************* 1997 Course Announcement *********
MODERN REGRESSION AND CLASSIFICATION
Waikiki, Hawaii: February 17-18, 1997
*************************************************
A two-day course on widely applicable statistical methods for
modeling and prediction, featuring
Professor Trevor Hastie and Professor Robert Tibshirani
Stanford University University of Toronto
This course was offered and enthusiastically attended at five
different locations in the USA in 1996.
This two day course covers modern tools for statistical prediction and
classification. We start from square one, with a review of linear
techniques for regression and classification, and then take attendees
through a tour of:
o Flexible regression techniques
o Classification and regression trees
o Neural networks
o Projection pursuit regression
o Nearest Neighbor methods
o Learning vector quantization
o Wavelets
o Bootstrap and cross-validation
We will also illustrate software tools for implementing the methods.
Our objective is to provide attendees with the background and
knowledge necessary to apply these modern tools to solve their own
real-world problems. The course is geared for:
o Statisticians
o Financial analysts
o Industrial managers
o Medical and Quantitative researchers
o Scientists
o others interested in prediction and classification
Attendees should have an undergraduate degree in a quantitative
field, or have knowledge and experience working in such a field.
PRICE: $750 per attendee if received by January 15, 1997. Full time
registered students receive a 40% discount. Attendance is limited to
the first 60 applicants, so sign up soon! These courses fill up
quickly.
TO REGISTER: Fill in and return the form appended.
For more details on the course and the instructors:
__________________________________________ _______________
Credit card # (if payment by credit card) Expiration Date
(Lunch preference - tick as appropriate):
___ Vegetarian ___ Non-Vegetarian
Fee payment can be made by MONEY ORDER , PERSONAL CHECK, or CREDIT CARD
(Mastercard or Visa.) For checks and money orders: all amounts are given in
US dollar figures. Make fee payable to Prof. T. Hastie. Mail it, together
with this completed Registration Form to:
Prof. T. Hastie
538 Campus Drive
Stanford
CA 94305
USA
For payment by credit card, include credit card details above, and mail to
above address, or else FAX form to 415-326-0854
For further information, contact:
Trevor Hastie
Stanford University
Tel. or FAX: 415-326-0854
e-mail: trevor@stat.stanford.edu.
Standard Registration: U.S. $750 ($950 after Jan 15, 1997)
Student Registration: U.S. $450 ($530 after Jan 15, 1997)
Student registrations - include copy of student ID.
- Cancellation policy: No fee if cancellation before Jan 15, 1997.
- Cancellation fee after January 15 but before Feb 12, 1997: $100.
- Refund at discretion of organizers if cancellation after Feb 12, 1997.
- Registration fee includes course materials, coffee breaks, and lunches
- On-site Registration is possible if course is not fully booked, at late
fee.
Applications for entrance into the Computational Finance MS programs
for Fall Quarter 1997 are currently being considered. The deadlines
for receipt of applications are:
January 15 (Early Decision Deadline, decisions by February 15)
March 15 (Final Deadline, decisions by April 15) Previous10NextTop