Knowledge Discovery Nuggets 97:14, e-mailed 97-04-23

KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1997, 1996, 1995, 1994, 1993

Knowledge Discovery Nuggets 97:14, e-mailed 97-04-23

News:
* E. Bertino, Query: data mining from wafers manufacturing process ?
Publications:
* M. Ramoni, Technical Reports on Bayesian Knowledge Discovery,

http://kmi.open.ac.uk/~marco/projects/kdd

* Tom Mitchell, Text book for Data Mining: Machine Learning

http://www.cs.cmu.edu/~tom/mlbook.html

Siftware:
* R. Quinlan, Windows Version of C5.0 ('See5') Available Now

http://www.rulequest.com

* Stanley Rice, Postcoordinate Software

http://www.cruzio.com/~autospec/darwin.htm

* Pamela Lerwick, IDIS Special Release

http://www.datamining.com

Positions:
* R. King, Ph.D. Studentships in Data Mining at University of Wales, UK
* Fred J. Damerau, Research Associate in Text Mining/Information
Extraction
--
Data Mining and Knowledge Discovery community, focusing on the
latest research and applications.

Submissions are most welcome and should be emailed, with a
DESCRIPTIVE subject line (and a URL) to gps.
Please keep CFP and meetings announcements short and provide
a URL for details.

To subscribe, see

http://www.kdnuggets.com/subscribe.html

KD Nuggets frequency is 3-4 times a month.
Back issues of KD Nuggets, a catalog of data mining tools
('Siftware'), pointers to Data Mining Companies, Relevant Websites,
Meetings, and more is available at Knowledge Discovery Mine site
at

http://www.kdnuggets.com/

-- Gregory Piatetsky-Shapiro (editor)
gps

********************* Official disclaimer ***************************
All opinions expressed herein are those of the contributors and not
necessarily of their respective employers (or of KD Nuggets)
*********************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Restlessness and discontent are the necessities of progress.
--Thomas A. Edison

Previous 1 Next Top

From: bertino@dsi.unimi.it
Date: Thu, 17 Apr 1997 09:44:45 +0200 (METDST)
Subject: data mining from wafers manufacturing process

At our University, we are starting an application project
dealing with data from a wafers manifacturing process.
We are thinking to use data mining techniques
for try to address the following problem.
Some of those wafers are faulty. There is a database keeping track
of the entire manifacturing process for each wafer and collecting
large amount of data concerning each step of the manifacturing
process (there are about 300 steps; each step is characterized
about 100 parameters). Our problem is use data mining techniques
in helping the diagnosis, that is, to see which step
may have caused the problem.

I was wondering whether you are aware of any use of data mining
techniques for similar problems. We have also to acquire
some suitable data mining tools.

I would appreciate any suggestion you may give me on this
issue.

Best regards Elisa
-------------------------------------------------------------------------------
Prof. Elisa Bertino
Dipartimento di Scienze dell'Informazione
Universita' di Milano
Via Comelico 39/41
20135 Milano (Italy)

tel: (+39)2-55006227
fax: (+39)2-55006253

e-mail: bertino@dsi.unimi.it
bertino@disi.unige.it
www

http://mercurio.sm.dsi.unimi.it/~bertino/

Previous 2 Next Top

Date: Wed, 9 Apr 1997 19:23:44 +0100
From: Marco Ramoni (M.Ramoni@open.ac.uk)
Subject: Technical Reports Available

The following reports are available on the World Wide Web. Further
information about the Bayesian Knowledge Discovery Project can be
reached at

http://kmi.open.ac.uk/~marco/projects/kdd

Marco
______________________________________________________________________________

Title: Efficient Parameter Learning in Bayesian Networks from
Incomplete Databases
Authors: Marco Ramoni [1] and Paola Sebastiani [2]
1.Knowledge Media Institute, The Open University.
2.Department of Actuarial Science and Statistics, City University.

TR number: KMI-TR-41
Date: January 1997
Keywords: Bayesian Belief Networks; Machine Learning,
Probabilistic Reasoning, Missing Data.

Abstract:
Current methods to learn conditional probabilities from incomplete
databases use a common strategy: they complete the database by
inferring somehow the missing data from the available information and
then learn from the completed database. This paper introduces a new
method - called bound and collapse (BC) - which does not follow this
strategy. BC starts by bounding the set of estimates consistent with the
available information and then collapses the resulting set to a point
estimate via a convex combination of the extreme points, with weights
depending on the assumed pattern of missing data. Experiments
comparing BC to the Gibbs Samplings are also provided.

WWW:

http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-41-abstract.html

______________________________________________________________________________

Title: Learning Bayesian Networks from Incomplete Databases
Authors: Marco Ramoni [1] and Paola Sebastiani [2]
1.Knowledge Media Institute, The Open University.
2.Department of Actuarial Science and Statistics, City University.

Reference: Technical Report KMI-TR-43
Date: February 1997
Keywords: Bayesian Belief Networks, Bayesian Learning, Missing Data, Model
Selection

Abstract:
Bayesian approaches to learn the graphical structure of Bayesian Belief
Networks (BBNs) from databases share the assumption that the
database is complete, that is, no entry is reported as unknown. Attempts
to relax this assumption often involve the use of expensive iterative
methods to discriminate among different structures. This paper
introduces a deterministic method to learn the graphical structure of a
BBN from a possibly incomplete database. Experimental evaluations
show a significant robustness of this method and a remarkable
independence of its execution time from the number of missing data.

WWW:

http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-43-abstract.html

_____________________________________________________________________________

Title: The Use of Exogenous Knowledge to Learn Bayesian Networks
from Incomplete Databases
Authors: Marco Ramoni [1] and Paola Sebastiani [2]
1.Knowledge Media Institute, The Open University.
2.Department of Actuarial Science and Statistics, City University.

TR number: KMI-TR-44
Date: February 1997
Keywords: Information extraction, Uncertainty and noise in data,
Bayesian inference.

Abstract:
Current methods to learn Bayesian Networks from incomplete
databases share the common assumption that the unreported data are
missing at random. This paper describes a method - called Bound and
Collapse (BC) - to learn Bayesian Networks from incomplete databases
which allows the analyst to efficiently integrate the information
provided by the database and the exogenous knowledge about the pattern
of missing data. BC starts by bounding he set of estimates consistent
with the available information and then collapses the resulting set to
a point estimate via a convex combination of the extreme points, with
weights depending on the assumed pattern of missing data. Experiments
comparing BC to the Gibbs Samplings are also provided.

WWW:

http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-44-abstract.html

____________________________________________________________________________

Title: Discovering Bayesian Networks in Incomplete Databases
Authors: Marco Ramoni [1] and Paola Sebastiani [2]
1.Knowledge Media Institute, The Open University.
2.Department of Actuarial Science and Statistics, City University.

TR number: KMI-TR-46
Date: March 1997
Keywords: Information extraction, Uncertainty and noise in data,
Bayesian inference.

Abstract:
Bayesian Belief Networks (BBNs) are becoming increasingly
popular in the Knowledge Discovery and Data Mining community. A
BBN is defined by a graphical structure of conditional dependencies
among the domain variables and a set of probability distributions
defining these dependencies. In this way, BBNs provide a compact
formalism - grounded in the well-developed mathematics of
probability theory - able to predict variable values, explain
observations, and visualize dependencies among variables. During
the past few years, several efforts have been addressed to develop
methods able to extract both the graphical structure and the
conditional probabilities of a BBN from a database. All these
methods share the assumption that the database at hand is complete,
that is, it does not report any entry as unknown. When this
assumption fails, these methods have to resort to expensive iterative
procedures which are infeasible for large databases. This paper
describes a new Knowledge Discovery system based on an efficient
method able to extract the graphical structure and the probability
distributions of a BBN from possibly incomplete databases. An
application using a large real-world database will illustrate methods
and concepts underlying the system and will assess its advantages as
a Knowledge Discovery system.

WWW:

http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-46-abstract.html

______________________________________________________________________________
Marco Ramoni
Knowledge Media Institute Phone: +44-1908-65-5721
The Open University Fax: +44-1908-65-3169
Walton Hall Email: M.Ramoni@open.ac.uk
Milton Keynes MK7 6AA URL:

http://kmi.open.ac.uk/~marco

UNITED KINGDOM CUSeeMe: 137.108.81.18

Previous 3 Next Top

Date: Wed, 16 Apr 1997 10:24:19 -0400
From: Tom Mitchell (Tom_Mitchell@daylily.learning.cs.cmu.edu)
Sibject: Text book for Data Mining: Machine Learning by Tom Mitchell

DATAMINING TEXTBOOK: Machine Learning, Tom Mitchell, McGraw Hill

McGraw Hill announces immediate availability of a new textbook that
covers the primary algorithms used in datamining. MACHINE LEARNING
provides a thorough, interdisciplinary introduction to the key
algorithms used in datamining.

Free inspection copies are available for instructors, by contacting
Betsy Jones (McGraw Hill) at (630) 789-5057.

The chapter outline is:

1. Introduction
2. Concept Learning and the General-to-Specific Ordering
3. Decision Tree Learning
4. Artificial Neural Networks
5. Evaluating Hypotheses
6. Bayesian Learning
7. Computational Learning Theory
8. Instance-Based Learning
9. Genetic Algorithms
10. Learning Sets of Rules
11. Analytical Learning
12. Combining Inductive and Analytical Learning
13. Reinforcement Learning

(414 pages)

This book is intended for upper-level undergraduates, graduate
students, and professionals working in the area of datamining, machine
learning, and statistics. The text includes over a hundred homework
exercises, along with web-accessible code and datasets (e.g., neural
networks applied to face recognition, Bayesian learning applied to
text classification).

For further information and ordering instructions, see

http://www.cs.cmu.edu/~tom/mlbook.html

Previous 4 Next Top

From: quinlan@rulequest.com (Ross Quinlan)
Date: Wed, 16 Apr 1997 07:47:28 -0400 (EDT)
Subject: Windows Version of C5.0 ('See5') Available Now

Please see

http://www.rulequest.com

for details. As with the
Unix version, a scaled-down demonstration version is free, and
there is also a free 10-day trial of the real thing.

Ross

Previous 5 Next Top

[The following is a commercial announcement. GPS]
Date: Sat, 19 Apr 97 11:51:52 PDT
From: Stanley Rice (autospec@mail.cruzio.com)

Now that spring is sprung, what about tasting some
PRECOORDINATE WINES FROM POSTCOORDINATE BOTTLES? ;-)

Like the taste of wine, relevance is not objective to us. It
is subjective, without crisp definition, dependent on our
context, describable only by fuzzy postcoordinations. SIGs
as well as individuals recognize relevance only in context.

With a little help from our friends we can optimize
relevance. But most folks have never even heard the word
postcoordination. Precoordinate systems still predominate--
Yahoo categories, single topic and alphabetical filings--at
work, at school, and at home.

The Internet, AltaVista-style search engines, and Thematic
concept filtering will change a lot of that before long. The
change may come more smoothly because old precoordinations
can be included under postcoordinations, and actually be
much enhanced thereby. Just putting the old wine in the new
bottles can multiply its bouquet and value. (No, there is
nothing for sale here.)

Examples of postcoordination possibilities with included
fuzzy precoordinations, suited to electronic libraries,
corporate intranets (and many other 'incoherent' but
currently precoordinated collections) are given at:

http://www.cruzio.com/~autospec/darwin.htm

(Darwin's 'The Voyage of the Beagle' is used to illustrate
Dewey precoordinations included under postcoordinations.)
Want a different kind of example? Consider 'Correlating
Symptoms and Remedies,' which includes uses for various
kinds of traditional diagnostic precoordinations:

http://www.cruzio.com/~autospec/accessf.htm

On the Autospec home page (address below) we look at
postcoordination of contextual and conceptual filtering from
many points of view. Your reactions are always appreciated.
In any case, relax and have another glass. It's spring! ;-)

Regards, Stan Rice

--
THEMATICS: Conceptual & Marketing Access to Text and Media
AUTOSPEC, Inc. Santa Cruz, CA. Stan Rice Voice: (408) 457-1430
Home page for Autospec:

http://www.cruzio.com/~autospec/

Previous 6 Next Top

[The following is a commercial announcement. GPS]

Date: Tue, 22 Apr 1997 11:09:49 -0700
From: Pamela Lerwick (minedata@pipeline.com)
Subject: IDIS Special Release

Contact: IDI Marketing Communications
(310) 936-3600

New Machine-Man Paradigm
Refocuses Data Mining

Novel Approach Based on Explainable Intranet Documents Introduces New
Languages and Techniques for Data Mining

_____________________________________________________________________________

Los Angeles -- April 21, 1997

The 1997 Database World Conference in Boston will witness the birth of a new
computing paradigm for decision support -- certain to affect the way
corporations use and benefit from computers. While most computing to date
has focused on man-machine interaction, this new and novel approach
introduces machine-man interaction.

In man-machine systems, humans view machines as 'order-takers' -- we tell
machines what to do, not help them tell us what they know. This one-way bias
is manifest even in the term man-machine itself.

While the direction of man-machine systems has been from man to machine, the
focus of machine-man interaction is from machine to man, assisting machines
to say their piece -- delivering the benefits of the immense knowledge they
possess. This does not mean natural language output, but is based on a
specific and novel approach to model building, data structuring, language
design and information delivery.

With a database query language or a programming language, the user types or
otherwise inputs a query or program -- the machine then tries to understand
it and generate a response. In machine-man interaction, the machine types up
a set of statements as an 'explainable document' and the user understands
them to improve decision making.

This dramatic new idea will be first presented at the Database World
Conference in Boston, on May 20, 1997 by Dr. Kamran Parsaye, CEO of
Information Discovery, Inc.
He will discuss the far reaching consequences of this paradigm for corporate
computing.

The NASA Scientific and Technical Information Program defines a man-machine
system as: 'A System in which the functions of the man and the machine are
interrelated and necessary for the operation of the system.' Similarly, Dr.
Parsaye defines a machine-man system as: 'A System in which the functions of
the machine and the man are interrelated and necessary for the thinking of
the man.'

For a machine to tell us anything, it needs a suitable language of
expression. It needs to be able to phrase its knowledge in terms of a
language understandable by us. When dealing with computer systems, the term
'language' has often been used in the context of programming languages and
query languages. In machine-man interaction, we need languages that help
machines express their knowledge for our benefit -- i.e. knowledge
expression languages.

Programming and query languages have to be understandable by computers,
knowledge expression languages have to be comprehensible to human users --
they are the tools machines use to help us. Dr. Parsaye will illustrate how
traditional languages and systems such as SQL or OLAP are inadequate due to
their focus on one-way interaction models.

Machine-man interaction requires three distinct language facilities: First a
language to organize the environment and develop scripts, etc. as one does
in any system, second a language to let a developer or analyst define
models, set up scenarios and specify terms for the lexicon to be used by the
machine (i.e. an interactive document composition language), and third a
language to allow the machine to express knowledge (i.e. a knowledge
expression language.)

Using agent technology on the inter/intranet, machine-man system have a life
of their own. They look for patterns with agents, perform discovery and
when there is something interesting to say, they generate an 'explainable
document' on the intranet in plain English (or Italian, French, etc.)
accompanied by graphs. Machines need no longer be just order-takers, but can
be the finders and communicators of knowledge.

The impact of the new paradigm on corporate planning for decision support
and data warehousing will be significant. Business users and IS departments
need no longer just consider 'tools' as a method of data mining, but can
rely on automatically generated Java-based explainable documents with rich
text and graphic content. This will simultaneously accelerate the use of
Java, intranets, data warehousing and data mining.

For more information on the Database World Conference please visit DCI at

http://www.DCIexpo.com

on the internet, or call (508) 470-3870. For more
information on Information Discovery, Inc. please visit

http://www.datamining.com

on the internet or call (310) 937-3600.

Pamela Lerwick

Previous 7 Next Top

Date: Mon, 14 Apr 1997 17:14:00 +0100
From: ROSS DONALD KING (rdk@aber.ac.uk)
Subject: Ph.D. Studentships

Field: data mining, machine learning, ILP, scientific discovery

Place: University of Wales, Aberystwyth
Wales, UK

Applications are invited for Ph.D. Studentships in the area of data mining
in the Centre for Intelligent Systems at the Department of Computer
Science, University of Wales, Aberystwyth.

The Centre for Intelligent Systems has a particular interest in
knowledge rich data mining systems, Inductive Logic programming,
and applications in biology and chemistry.

Applicants should have at least a 2(i) in Computer Science or related
subject, with a good background in Artificial Intelligence or
Statistics.

More information can be obtained from
Professor Mark Lee or Dr. Ross D. King

Department of Computer Science,
University of Wales,
Penglais,
Aberystwyth,
Ceredigion, SY23 3DB,
Wales, UK

Tel: +44 1970 622420
Fax: +44 1970 622455
Email: mhl@aber.ac.uk rdk@aber.ac.uk

or from the URLs:

http://www.aber.ac.uk/~dcswww/Public/Recruitment/Proposals/

http://www.aber.ac.uk/~dcswww/Public/Research/

Previous 8 Next Top

Date: Thu, 17 Apr 97 09:32:42 EDT
From: 'Fred J. Damerau (862-2214)' (DAMERAU@watson.ibm.com)
Subject: Research Associate Position in Text Mining/Information Extraction

The Natural Language Understanding Group at the IBM T. J. Watson
Research Laboratory (Yorktown Heights, NY 10566) is looking for
a Research Associate with the qualifications listed below. The
position will most likely be initially for one year, but it is
renewable. The successful candidate will work on our text mining/
information extraction project, with a particular emphasis on
applying machine learning techniques to various issues in document
management. The project combines state-of-the-art research on machine
learning in text mining with practical production-level systems building.

________________________________________________________________
Qualifications:

The ideal candidate would have the following knowledge and experience.

Education: MA/MS in computer science or other field with extensive
background in computer science.

Programming languages:
Extensive knowledge and experience in C/C++ required; Java a plus.

Specialized Background:
Experience in implementing machine learning algorithms and/or
natural language processing algorithms.

Operating systems:
Required: Familiarity with Windows95/NT and Unix/AIX,
Helpful: Familiarity with OS/2
System programming/API experience on these operating systems not required.

General Software Development:
Familiarity with issues of large scale software development, e.g.,
API design and use, creation and integration of DLLs/Libraries,
source code control systems etc.

Candidates should send resumes and supporting letters to:

Thomas Hampp
eMail: hampp@watson.ibm.com
phone: 914-945-1714

End of message

Previous 9 Next Top