Knowledge Discovery Nuggets 97:14, e-mailed 97-04-23

KDD Nuggets Index


To
KD Mine: main site for Data Mining and Knowledge Discovery.
To subscribe to KDD Nuggets, email to kdd-request
Past Issues: 1997, 1996, 1995, 1994, 1993


Knowledge Discovery Nuggets 97:14, e-mailed 97-04-23

News:
* E. Bertino, Query: data mining from wafers manufacturing process ?
Publications:
* M. Ramoni, Technical Reports on Bayesian Knowledge Discovery,
  • http://kmi.open.ac.uk/~marco/projects/kdd

  • * Tom Mitchell, Text book for Data Mining: Machine Learning
  • http://www.cs.cmu.edu/~tom/mlbook.html

  • Siftware:
    * R. Quinlan, Windows Version of C5.0 ('See5') Available Now
  • http://www.rulequest.com

  • * Stanley Rice, Postcoordinate Software
  • http://www.cruzio.com/~autospec/darwin.htm

  • * Pamela Lerwick, IDIS Special Release
  • http://www.datamining.com

  • Positions:
    * R. King, Ph.D. Studentships in Data Mining at University of Wales, UK
    * Fred J. Damerau, Research Associate in Text Mining/Information
    Extraction
    --
    Data Mining and Knowledge Discovery community, focusing on the
    latest research and applications.

    Submissions are most welcome and should be emailed, with a
    DESCRIPTIVE subject line (and a URL) to gps.
    Please keep CFP and meetings announcements short and provide
    a URL for details.

    To subscribe, see
  • http://www.kdnuggets.com/subscribe.html


  • KD Nuggets frequency is 3-4 times a month.
    Back issues of KD Nuggets, a catalog of data mining tools
    ('Siftware'), pointers to Data Mining Companies, Relevant Websites,
    Meetings, and more is available at Knowledge Discovery Mine site
    at
  • http://www.kdnuggets.com/


  • -- Gregory Piatetsky-Shapiro (editor)
    gps

    ********************* Official disclaimer ***************************
    All opinions expressed herein are those of the contributors and not
    necessarily of their respective employers (or of KD Nuggets)
    *********************************************************************

    ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Restlessness and discontent are the necessities of progress.
    --Thomas A. Edison

    Previous  1 Next   Top

    From: bertino@dsi.unimi.it
    Date: Thu, 17 Apr 1997 09:44:45 +0200 (METDST)
    Subject: data mining from wafers manufacturing process

    At our University, we are starting an application project
    dealing with data from a wafers manifacturing process.
    We are thinking to use data mining techniques
    for try to address the following problem.
    Some of those wafers are faulty. There is a database keeping track
    of the entire manifacturing process for each wafer and collecting
    large amount of data concerning each step of the manifacturing
    process (there are about 300 steps; each step is characterized
    about 100 parameters). Our problem is use data mining techniques
    in helping the diagnosis, that is, to see which step
    may have caused the problem.

    I was wondering whether you are aware of any use of data mining
    techniques for similar problems. We have also to acquire
    some suitable data mining tools.

    I would appreciate any suggestion you may give me on this
    issue.

    Best regards Elisa
    -------------------------------------------------------------------------------
    Prof. Elisa Bertino
    Dipartimento di Scienze dell'Informazione
    Universita' di Milano
    Via Comelico 39/41
    20135 Milano (Italy)

    tel: (+39)2-55006227
    fax: (+39)2-55006253

    e-mail: bertino@dsi.unimi.it
    bertino@disi.unige.it
    www
  • http://mercurio.sm.dsi.unimi.it/~bertino/



  • Previous  2 Next   Top
    Date: Wed, 9 Apr 1997 19:23:44 +0100
    From: Marco Ramoni (M.Ramoni@open.ac.uk)
    Subject: Technical Reports Available

    The following reports are available on the World Wide Web. Further
    information about the Bayesian Knowledge Discovery Project can be
    reached at

  • http://kmi.open.ac.uk/~marco/projects/kdd


  • Marco
    ______________________________________________________________________________

    Title: Efficient Parameter Learning in Bayesian Networks from
    Incomplete Databases
    Authors: Marco Ramoni [1] and Paola Sebastiani [2]
    1.Knowledge Media Institute, The Open University.
    2.Department of Actuarial Science and Statistics, City University.

    TR number: KMI-TR-41
    Date: January 1997
    Keywords: Bayesian Belief Networks; Machine Learning,
    Probabilistic Reasoning, Missing Data.

    Abstract:
    Current methods to learn conditional probabilities from incomplete
    databases use a common strategy: they complete the database by
    inferring somehow the missing data from the available information and
    then learn from the completed database. This paper introduces a new
    method - called bound and collapse (BC) - which does not follow this
    strategy. BC starts by bounding the set of estimates consistent with the
    available information and then collapses the resulting set to a point
    estimate via a convex combination of the extreme points, with weights
    depending on the assumed pattern of missing data. Experiments
    comparing BC to the Gibbs Samplings are also provided.

    WWW:
  • http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-41-abstract.html



  • ______________________________________________________________________________

    Title: Learning Bayesian Networks from Incomplete Databases
    Authors: Marco Ramoni [1] and Paola Sebastiani [2]
    1.Knowledge Media Institute, The Open University.
    2.Department of Actuarial Science and Statistics, City University.

    Reference: Technical Report KMI-TR-43
    Date: February 1997
    Keywords: Bayesian Belief Networks, Bayesian Learning, Missing Data, Model
    Selection

    Abstract:
    Bayesian approaches to learn the graphical structure of Bayesian Belief
    Networks (BBNs) from databases share the assumption that the
    database is complete, that is, no entry is reported as unknown. Attempts
    to relax this assumption often involve the use of expensive iterative
    methods to discriminate among different structures. This paper
    introduces a deterministic method to learn the graphical structure of a
    BBN from a possibly incomplete database. Experimental evaluations
    show a significant robustness of this method and a remarkable
    independence of its execution time from the number of missing data.

    WWW:
  • http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-43-abstract.html


  • _____________________________________________________________________________

    Title: The Use of Exogenous Knowledge to Learn Bayesian Networks
    from Incomplete Databases
    Authors: Marco Ramoni [1] and Paola Sebastiani [2]
    1.Knowledge Media Institute, The Open University.
    2.Department of Actuarial Science and Statistics, City University.

    TR number: KMI-TR-44
    Date: February 1997
    Keywords: Information extraction, Uncertainty and noise in data,
    Bayesian inference.

    Abstract:
    Current methods to learn Bayesian Networks from incomplete
    databases share the common assumption that the unreported data are
    missing at random. This paper describes a method - called Bound and
    Collapse (BC) - to learn Bayesian Networks from incomplete databases
    which allows the analyst to efficiently integrate the information
    provided by the database and the exogenous knowledge about the pattern
    of missing data. BC starts by bounding he set of estimates consistent
    with the available information and then collapses the resulting set to
    a point estimate via a convex combination of the extreme points, with
    weights depending on the assumed pattern of missing data. Experiments
    comparing BC to the Gibbs Samplings are also provided.

    WWW:
  • http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-44-abstract.html


  • ____________________________________________________________________________

    Title: Discovering Bayesian Networks in Incomplete Databases
    Authors: Marco Ramoni [1] and Paola Sebastiani [2]
    1.Knowledge Media Institute, The Open University.
    2.Department of Actuarial Science and Statistics, City University.

    TR number: KMI-TR-46
    Date: March 1997
    Keywords: Information extraction, Uncertainty and noise in data,
    Bayesian inference.


    Abstract:
    Bayesian Belief Networks (BBNs) are becoming increasingly
    popular in the Knowledge Discovery and Data Mining community. A
    BBN is defined by a graphical structure of conditional dependencies
    among the domain variables and a set of probability distributions
    defining these dependencies. In this way, BBNs provide a compact
    formalism - grounded in the well-developed mathematics of
    probability theory - able to predict variable values, explain
    observations, and visualize dependencies among variables. During
    the past few years, several efforts have been addressed to develop
    methods able to extract both the graphical structure and the
    conditional probabilities of a BBN from a database. All these
    methods share the assumption that the database at hand is complete,
    that is, it does not report any entry as unknown. When this
    assumption fails, these methods have to resort to expensive iterative
    procedures which are infeasible for large databases. This paper
    describes a new Knowledge Discovery system based on an efficient
    method able to extract the graphical structure and the probability
    distributions of a BBN from possibly incomplete databases. An
    application using a large real-world database will illustrate methods
    and concepts underlying the system and will assess its advantages as
    a Knowledge Discovery system.

    WWW:
  • http://kmi.open.ac.uk/kmi-abstracts/kmi-tr-46-abstract.html


  • ______________________________________________________________________________
    Marco Ramoni
    Knowledge Media Institute Phone: +44-1908-65-5721
    The Open University Fax: +44-1908-65-3169
    Walton Hall Email: M.Ramoni@open.ac.uk
    Milton Keynes MK7 6AA URL:
  • http://kmi.open.ac.uk/~marco

  • UNITED KINGDOM CUSeeMe: 137.108.81.18


    Previous  3 Next   Top
    Date: Wed, 16 Apr 1997 10:24:19 -0400
    From: Tom Mitchell (Tom_Mitchell@daylily.learning.cs.cmu.edu)
    Sibject: Text book for Data Mining: Machine Learning by Tom Mitchell

    DATAMINING TEXTBOOK: Machine Learning, Tom Mitchell, McGraw Hill

    McGraw Hill announces immediate availability of a new textbook that
    covers the primary algorithms used in datamining. MACHINE LEARNING
    provides a thorough, interdisciplinary introduction to the key
    algorithms used in datamining.

    Free inspection copies are available for instructors, by contacting
    Betsy Jones (McGraw Hill) at (630) 789-5057.

    The chapter outline is:

    1. Introduction
    2. Concept Learning and the General-to-Specific Ordering
    3. Decision Tree Learning
    4. Artificial Neural Networks
    5. Evaluating Hypotheses
    6. Bayesian Learning
    7. Computational Learning Theory
    8. Instance-Based Learning
    9. Genetic Algorithms
    10. Learning Sets of Rules
    11. Analytical Learning
    12. Combining Inductive and Analytical Learning
    13. Reinforcement Learning

    (414 pages)

    This book is intended for upper-level undergraduates, graduate
    students, and professionals working in the area of datamining, machine
    learning, and statistics. The text includes over a hundred homework
    exercises, along with web-accessible code and datasets (e.g., neural
    networks applied to face recognition, Bayesian learning applied to
    text classification).

    For further information and ordering instructions, see
  • http://www.cs.cmu.edu/~tom/mlbook.html



  • Previous  4 Next   Top
    From: quinlan@rulequest.com (Ross Quinlan)
    Date: Wed, 16 Apr 1997 07:47:28 -0400 (EDT)
    Subject: Windows Version of C5.0 ('See5') Available Now

    Please see
  • http://www.rulequest.com
  • for details. As with the
    Unix version, a scaled-down demonstration version is free, and
    there is also a free 10-day trial of the real thing.

    Ross


    Previous  5 Next   Top
    [The following is a commercial announcement. GPS]
    Date: Sat, 19 Apr 97 11:51:52 PDT
    From: Stanley Rice (autospec@mail.cruzio.com)

    Now that spring is sprung, what about tasting some
    PRECOORDINATE WINES FROM POSTCOORDINATE BOTTLES? ;-)

    Like the taste of wine, relevance is not objective to us. It
    is subjective, without crisp definition, dependent on our
    context, describable only by fuzzy postcoordinations. SIGs
    as well as individuals recognize relevance only in context.

    With a little help from our friends we can optimize
    relevance. But most folks have never even heard the word
    postcoordination. Precoordinate systems still predominate--
    Yahoo categories, single topic and alphabetical filings--at
    work, at school, and at home.

    The Internet, AltaVista-style search engines, and Thematic
    concept filtering will change a lot of that before long. The
    change may come more smoothly because old precoordinations
    can be included under postcoordinations, and actually be
    much enhanced thereby. Just putting the old wine in the new
    bottles can multiply its bouquet and value. (No, there is
    nothing for sale here.)

    Examples of postcoordination possibilities with included
    fuzzy precoordinations, suited to electronic libraries,
    corporate intranets (and many other 'incoherent' but
    currently precoordinated collections) are given at:

  • http://www.cruzio.com/~autospec/darwin.htm


  • (Darwin's 'The Voyage of the Beagle' is used to illustrate
    Dewey precoordinations included under postcoordinations.)
    Want a different kind of example? Consider 'Correlating
    Symptoms and Remedies,' which includes uses for various
    kinds of traditional diagnostic precoordinations:

  • http://www.cruzio.com/~autospec/accessf.htm


  • On the Autospec home page (address below) we look at
    postcoordination of contextual and conceptual filtering from
    many points of view. Your reactions are always appreciated.
    In any case, relax and have another glass. It's spring! ;-)

    Regards, Stan Rice

    --
    THEMATICS: Conceptual & Marketing Access to Text and Media
    AUTOSPEC, Inc. Santa Cruz, CA. Stan Rice Voice: (408) 457-1430
    Home page for Autospec:
  • http://www.cruzio.com/~autospec/



  • Previous  6 Next   Top
    [The following is a commercial announcement. GPS]

    Date: Tue, 22 Apr 1997 11:09:49 -0700
    From: Pamela Lerwick (minedata@pipeline.com)
    Subject: IDIS Special Release

    Contact: IDI Marketing Communications
    (310) 936-3600

    New Machine-Man Paradigm
    Refocuses Data Mining

    Novel Approach Based on Explainable Intranet Documents Introduces New
    Languages and Techniques for Data Mining

    _____________________________________________________________________________

    Los Angeles -- April 21, 1997

    The 1997 Database World Conference in Boston will witness the birth of a new
    computing paradigm for decision support -- certain to affect the way
    corporations use and benefit from computers. While most computing to date
    has focused on man-machine interaction, this new and novel approach
    introduces machine-man interaction.

    In man-machine systems, humans view machines as 'order-takers' -- we tell
    machines what to do, not help them tell us what they know. This one-way bias
    is manifest even in the term man-machine itself.

    While the direction of man-machine systems has been from man to machine, the
    focus of machine-man interaction is from machine to man, assisting machines
    to say their piece -- delivering the benefits of the immense knowledge they
    possess. This does not mean natural language output, but is based on a
    specific and novel approach to model building, data structuring, language
    design and information delivery.

    With a database query language or a programming language, the user types or
    otherwise inputs a query or program -- the machine then tries to understand
    it and generate a response. In machine-man interaction, the machine types up
    a set of statements as an 'explainable document' and the user understands
    them to improve decision making.

    This dramatic new idea will be first presented at the Database World
    Conference in Boston, on May 20, 1997 by Dr. Kamran Parsaye, CEO of
    Information Discovery, Inc.
    He will discuss the far reaching consequences of this paradigm for corporate
    computing.

    The NASA Scientific and Technical Information Program defines a man-machine
    system as: 'A System in which the functions of the man and the machine are
    interrelated and necessary for the operation of the system.' Similarly, Dr.
    Parsaye defines a machine-man system as: 'A System in which the functions of
    the machine and the man are interrelated and necessary for the thinking of
    the man.'

    For a machine to tell us anything, it needs a suitable language of
    expression. It needs to be able to phrase its knowledge in terms of a
    language understandable by us. When dealing with computer systems, the term
    'language' has often been used in the context of programming languages and
    query languages. In machine-man interaction, we need languages that help
    machines express their knowledge for our benefit -- i.e. knowledge
    expression languages.

    Programming and query languages have to be understandable by computers,
    knowledge expression languages have to be comprehensible to human users --
    they are the tools machines use to help us. Dr. Parsaye will illustrate how
    traditional languages and systems such as SQL or OLAP are inadequate due to
    their focus on one-way interaction models.

    Machine-man interaction requires three distinct language facilities: First a
    language to organize the environment and develop scripts, etc. as one does
    in any system, second a language to let a developer or analyst define
    models, set up scenarios and specify terms for the lexicon to be used by the
    machine (i.e. an interactive document composition language), and third a
    language to allow the machine to express knowledge (i.e. a knowledge
    expression language.)

    Using agent technology on the inter/intranet, machine-man system have a life
    of their own. They look for patterns with agents, perform discovery and
    when there is something interesting to say, they generate an 'explainable
    document' on the intranet in plain English (or Italian, French, etc.)
    accompanied by graphs. Machines need no longer be just order-takers, but can
    be the finders and communicators of knowledge.

    The impact of the new paradigm on corporate planning for decision support
    and data warehousing will be significant. Business users and IS departments
    need no longer just consider 'tools' as a method of data mining, but can
    rely on automatically generated Java-based explainable documents with rich
    text and graphic content. This will simultaneously accelerate the use of
    Java, intranets, data warehousing and data mining.

    For more information on the Database World Conference please visit DCI at
  • http://www.DCIexpo.com
  • on the internet, or call (508) 470-3870. For more
    information on Information Discovery, Inc. please visit
  • http://www.datamining.com
  • on the internet or call (310) 937-3600.

    Pamela Lerwick


    Previous  7 Next   Top
    Date: Mon, 14 Apr 1997 17:14:00 +0100
    From: ROSS DONALD KING (rdk@aber.ac.uk)
    Subject: Ph.D. Studentships

    Field: data mining, machine learning, ILP, scientific discovery

    Place: University of Wales, Aberystwyth
    Wales, UK

    Applications are invited for Ph.D. Studentships in the area of data mining
    in the Centre for Intelligent Systems at the Department of Computer
    Science, University of Wales, Aberystwyth.

    The Centre for Intelligent Systems has a particular interest in
    knowledge rich data mining systems, Inductive Logic programming,
    and applications in biology and chemistry.

    Applicants should have at least a 2(i) in Computer Science or related
    subject, with a good background in Artificial Intelligence or
    Statistics.

    More information can be obtained from
    Professor Mark Lee or Dr. Ross D. King

    Department of Computer Science,
    University of Wales,
    Penglais,
    Aberystwyth,
    Ceredigion, SY23 3DB,
    Wales, UK

    Tel: +44 1970 622420
    Fax: +44 1970 622455
    Email: mhl@aber.ac.uk rdk@aber.ac.uk

    or from the URLs:
  • http://www.aber.ac.uk/~dcswww/Public/Recruitment/Proposals/

  • http://www.aber.ac.uk/~dcswww/Public/Research/




  • Previous  8 Next   Top
    Date: Thu, 17 Apr 97 09:32:42 EDT
    From: 'Fred J. Damerau (862-2214)' (DAMERAU@watson.ibm.com)
    Subject: Research Associate Position in Text Mining/Information Extraction

    The Natural Language Understanding Group at the IBM T. J. Watson
    Research Laboratory (Yorktown Heights, NY 10566) is looking for
    a Research Associate with the qualifications listed below. The
    position will most likely be initially for one year, but it is
    renewable. The successful candidate will work on our text mining/
    information extraction project, with a particular emphasis on
    applying machine learning techniques to various issues in document
    management. The project combines state-of-the-art research on machine
    learning in text mining with practical production-level systems building.

    ________________________________________________________________
    Qualifications:

    The ideal candidate would have the following knowledge and experience.

    Education: MA/MS in computer science or other field with extensive
    background in computer science.

    Programming languages:
    Extensive knowledge and experience in C/C++ required; Java a plus.

    Specialized Background:
    Experience in implementing machine learning algorithms and/or
    natural language processing algorithms.

    Operating systems:
    Required: Familiarity with Windows95/NT and Unix/AIX,
    Helpful: Familiarity with OS/2
    System programming/API experience on these operating systems not required.

    General Software Development:
    Familiarity with issues of large scale software development, e.g.,
    API design and use, creation and integration of DLLs/Libraries,
    source code control systems etc.

    Candidates should send resumes and supporting letters to:

    Thomas Hampp
    eMail: hampp@watson.ibm.com
    phone: 914-945-1714

    End of message

    Previous  9 Next   Top