KDD Nugget 94:20, e-mailed 94-11-18
Contents: 
	* M. Moulet, CFP: Statistics, ML, and KDD workshop in Europe, April 95
	* P. Godfrey, Tech. report on Failing Database Queries
	* X. Wu, KDD Tutorial Overheads Available by FTP
	* Blix, AI/ML oriented WWW page
	* Bob Duin, WWW page for pattern recognition
	* G. Piatetsky-Shapiro, WSJ: Dark Side of Database Marketing

The KDD Nuggets is a moderated list for the exchange of information 
relevant to Knowledge Discovery in Databases (KDD, also known as Data Mining),
e.g. application descriptions, conference announcements, tool reviews, 
information requests, interesting ideas, clever opinions, etc.
It has been coming out about every two-three weeks, depending on the
quantity and urgency of submissions. 

 Back issues of nuggets, a catalog of data mining tools, useful references,
 FAQ, and other KDD-related information are now available 
 at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/  or 
 by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README

E-mail contributions to kdd@gte.com
Add/delete requests to kdd-request@gte.com 
	-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) * 
* and not necessarily of their respective employers (or GTE Laboratories)   *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"This class was a religious experience for me...
 I had to take it all On faith."

"What's the quality of the text?  'Text is printed on high quality paper.'"
	from MIT Course Evaluation Guide, forwarded by Sam Uthurusamy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 4 Nov 1994 15:59:27 +0100
From: Marjorie Moulet <Marjorie.Moulet@lri.fr>
Subject: cfp

 Call For Papers for MLnet Sponsored Familiarization Workshop: 

 Statistics, Machine Learning, and Knowledge Discovery In Databases
 Heraklion, Crete, Greece, 
 April 28-29, 1995
 ----------------------------------------------
  
Submission Deadline:  Feb 3, 1995.


Introduction

The intersection of, and interaction between Machine Learning (ML), 
statistics and knowledge discovery is a rapidly growing area of 
interest. There are several areas of common research.
On one hand, the considerable amount of data available in databases 
is an extraordinary source for extracting knowledge and tracing  
causal dependencies and for this reason, Knowledge Discovery in 
Databases (KDD) has become an important and new challenge, 
economically as well as scientifically. On the other hand, the 
statistical and machine learning algorithms can be considered as 
robust instruments for KDD. These facts make obvious that the KDD is 
an interdisciplinary concept benefiting from statistics, machine 
learning and database technology.
The success of the previous workshops on KDD at IJCAI89 and the AAAI 
(91, 93, 94) Conferences as well as the MLnet Workshop on Machine 
Learning and Statistics confirm a growing awareness and interest in 
this area.


The main topics of interest:

*   Statistical and ML approaches for discovery of causal structures 
in databases
*   Incremental learning methods to handle the temporal aspects of 
databases
*   Statistical and ML approaches for comprehensibility and 
Validation of discovered knowledge
*   Software architecture of KDD-Tools
*   Combining of ML and statistical approaches to build hybrid 
algorithms for KDD
*   Focusing aspects to improve the performance of KDD-Tools
*   Classification and prediction algorithms to handle large-scale 
data
*   Successful applications in Medicine, Business and Industry


Form of the workshop

The main part of the workshop will be devoted to presentations and 
invited talks. Talks will be grouped according to common interest 
with plenty of time for discussions.


Format and kind of contribution

Contributions are invited which present theoretical or practical 
results as well as review papers in one of the above topics of interest. 
Submitted papers must not exceed 12 pages, including abstract and 
bibliography. The title page should include the title of the talk, 
name(s) and affiliation(s) of the author(s) and a list of keywords 
as well as the full address (including E-Mail) of the first author.
Proceedings will be available at the workshop.
The submission of an electronic version (postscript or LaTeX) of a 
paper is highly recommended. Participation without submitting a full paper 
is possible but we encourage the all particpants to submit an abstract 
(up to two pages) to help the organisers to clarify the topics of interests.
A sample paper in LaTeX is available via ftp from amsta.leeds.ac.uk in 
directory  pub/ecml95. 

  
Contributions to the workshops should be sent to the
coordinator (see below the address) by 3rd February'95. 
 Decisions on acceptance will be 
made by 3rd March (early registration for ECML'95 closes on 17th 
February). Final camera-ready versions of the revised papers should 
be sent to the coordinator of the individual workshops by 24th March.


Submissions and queries should be sent to the coordinator: 

Gholamreza Nakhaeizadeh
Daimler Benz AG
Research and Technology / F3W
Postfach 2360
D-89013 Ulm
Germany

Email:   nakhaeizadeh@dbag.ulm.DaimlerBenz.COM
Phone:   +49 731 505 2860
Fax:     +49 731 505 4210


The workshop will be hosted at University of Heraklion, Crete, Greece.
It will take place immediately after the European Conference on 
Machine Learning (ECML95) at April 28-29, 1995. Funding for travelling
and living expenses are available according to the rules of the ML-net
work of excellence.

General enquiries about the workshop should be addressed to Derek 
Sleeman (Aberdeen) and about local arrangements to Vassilis Moustakis 
(Crete).


Chairs

Yves Kodratoff,  University of Paris-Sud, France
Gholamreza Nakhaeizadeh, Daimler-Benz, Ulm, Germany
Charles Taylor, University of Leeds, GB


Program Committee

Peter Edwards, University of Aberdeen, GB
Attilio Giordana, University of Torino, Italy
Usama Fayyad, Jet Propulsion Lab, USA
Bob Henery, University of Strathclyde, GB
Willi Klosgen, GMD, Germany
Heikki Mannila, University of Helsinki, Finland
Marjorie Moulet, Orsay, France
Gregory Piatetsky-Shapiro, GTE, USA
Arno Siebes, CWI, Amsterdam, The Netherlands
Rudiger Wirth, Daimler-Benz, Ulm, Germany
Jan Zytkow, Wichita State University, USA


--------------------------------------------
Return-Path: <godfrey@cs.UMD.EDU>
To: KDD Nuggets Moderator <kdd%eureka@gte.com>
Subject: Finding the failure points of failing queries to databases
Date: Mon, 07 Nov 1994 23:11:48 -0500
From: Parke Godfrey <godfrey@cs.UMD.EDU>

I am submitting this to the list as I believe it may be relevant to
the KDD community.  I have written a paper on the finding minimal
failing subqueries (MFSs) to failing queries in databases.

    Parke Godfrey,
    Minimization in Cooperative Response to Failing Database Queries,
    Technical Report (CS-TR-3348 and UMIACS-TR-94-108),
    CSD and UMIACS, University of Maryland at College Park,
    September 1994.  Submitted to the International Journal of
    Intelligent and Cooperative Information Systems.

The abstract is below.

At Maryland, we have been engaged in the areas of cooperative
answering and cooperative database systems.  The premise is that
current database systems are often difficult to use because the
answers they provide, while correct, are often misleading.
Cooperative answering seeks to alleviate this problem by providing
more informative responses to queries above and beyond the literal
answers to avoid misleading the user.

One question that arises is what to do when a database query fails.
In the least, it seems reasonable to identify the failure points of
the query to report back to the user.  So we should not just say that
the query failed, but also report which parts of the query failed.
Given a conjunctive query (a set of conditions to be satisfied), this
means to find minimal subsets of the query (subqueries) that fail.
There has been a fair amount of interest in this technique in the
literature.  These MFSs are also often called "false presuppositions"
of the query.

The complexity profile of the problem of finding the MFSs of a query
is surprising.  This paper maps out what the complexity is, and it
develops a good algorithm for finding the MSFs; no good (tractable)
algorithm has existed before.

I believe that the work people are doing in cooperative database
systems is closely related to work in KDD.  We must analyze queries
and the data-store to devise cooperative information (semantic
information).  In KDD, the data-store is analyzed for the purpose of
finding interesting trends and properties (semantic information).  One
potent tool to probe the data-store is to ask queries.

An approach to finding candidate integrity constraints (ICs) over a
database is to ask queries one suspects will fail.  If the query does
fail, it may be that it characterizes the database in general, and
should be promoted as an IC.  (Of course, this method looks for
absolute ICs and is not good for finding probabilistic rules.  It can
only find strong rules.)  However, it is not the failing query itself
that we should consider as a possible IC; it is its MFSs that we
should consider.  I believe that the algorithms I present in the paper
might be amenable to such a KDD technique.

Our papers on cooperative database systems can be found at our WWW
site, including the MFS paper I am advertising here.

    URL = http://karna.cs.umd.edu:3264/

They are also available by anonymous FTP in

    ftp.cs.umd.edu:/pub/prism/

If anyone is interested in this work or has comments on it, please let
me know.  I would appreciate to hear any questions, ideas, comments,
and criticisms, and of related works.  Thank you.

 -- Parke Godfrey
    Department of Computer Science
    University of Maryland

==============================================================================
    Minimization in Cooperative Response to Failing Database Queries
    Parke Godfrey

    When a query fails, it is more cooperative to identify the
    cause of failure, rather than just to report the empty answer
    set. If there is not a cause for the query's failure, it is
    worthwhile to report the part of the query which failed. To
    identify a minimal failing subquery (MFS) of the query is the
    best way to do this. (This MFS is not unique; there may be many
    of them.) Likewise, to identify a maximal succeeding subquery
    (MSS) can help a user to recast a new query that leads to a
    non-empty answer set.

    Database systems do not provide the functionality of these
    types of cooperative responses. This may be, in part, because
    algorithmic approaches to finding the MFSs and the MSSs to a
    failing query are not obvious.  The search space of subqueries is
    large.  Despite work on MFSs in the past, the algorithmic
    complexity of these identification problems had remained
    uncharted.

    This paper shows the complexity profile of MFS and MSS
    identification. It is shown that there exists a simple
    algorithm for finding a MFS or a MSS by asking N subsequent
    queries, in which N is the length of the query. To find more MFSs
    (or MSSs) can be hard. It is shown that to find O(N) MFSs (or
    MSSs) is NP-hard. To find k MFSs (or MSSs), for a fixed k,
    remains polynomial.

    An optimal algorithm for enumerating MFSs and MSSs, ISHMAEL,
    is developed and presented. The algorithm has ideal
    performance in enumeration, finding the first answers quickly,
    and decaying toward intractability in a predictable manner as
    more answers are found.

    The complexity results and the algorithmic approaches given
    in this paper should allow for the construction of MFS and
    MSS facilities for database systems. These results are relevant
    to a number of problems outside of databases too, and may find
    further application.


--------------------------------------------
Date: Thu, 17 Nov 1994 17:01:55 +1100 (EST)
From: Xindong Wu <xindong@INSECT.SD.MONASH.EDU.AU>
Subject: KDD Tutorial Overheads Available by FTP

          KDD Tutorial Overheads Available by FTP

The overheads for the AI'94 Tutorial on Intelligent Learning Database
Systems are now available by anonymous ftp from coral.cs.jcu.edu.au in
pub/HCV/KDD.ps.

The following is an outline of the tutorial.

  Knowledge acquisition from databases is a research frontier for both
  database technology and machine learning (ML) techniques, and has
  seen sustained research over recent years.  It also acts as a link
  between the two fields, thus offering a dual benefit.  Firstly,
  since database technology has already found wide application in many
  fields, ML research obviously stands to gain from this greater
  exposure and established technological foundation. Secondly, ML
  techniques can augment the ability of existing database systems to
  represent, acquire, and process a collection of expertise such as
  those which form part of the semantics of many advanced applications
  (e.g. CAD/CAM). This full-day tutorial presents and discusses
  techniques for the following 3 interconnected phases in constructing
  intelligent learning database systems: (1) Translation of standard
  database information into a form suitable for use by a rule-based
  system; (2) Using machine learning techniques to produce rule bases
  from databases; and (3) Interpreting the rules produced to solve
  users' problems and/or reduce data spaces.  It suits a wide audience
  including postgraduate students and industrial people from
  databases, expert systems, and machine learning.

Comments and suggestions for improvements are solicited!

Xindong Wu
--------------------------------------------

Date: Fri, 4 Nov 1994 15:44:51 -0600
Subject: AI/ML oriented WWW page
From: blix@cs.uiuc.edu (via ML List)

The AI Group and the Inductive Learning Group at the University of
Illnois maintains a World Wide Web page of useful resources for AI
researchers in general and machine learning researchers in particular
at:

	http://www-ilg.cs.uiuc.edu/info/

that is currently stable enough that we think it can go public.  Of
particular interest is a relatively up-to-date set of call for
conferences and workshops.  The rest contains mainly links to other
useful sources of information.

Feel free to use it, and let me know if you have suggestions or
additions;  they will be taken care of as time permits.


--------------------------------------------

Date: Tue, 1 Nov 94 08:36:44 +0100
From: Bob Duin <bob@ph.tn.tudelft.nl> (via ML List)
Subject: WWW page for pattern recognition

To researchers and students in paterrn recognition,

Those who are interested to approach the WWW from the PR point of view
and have direct access to some services as well, might try:

http://galaxy.ph.tn.tudelft.nl:2000/PRInfo.html

Suggestions are welcomed,

Bob Duin

R.P.W. Duin                                 Phone:  (31) 15 786143
Faculty of Applied Physics                  Fax:    (31) 15 626740
Delft University of Technology              E-mail: duin@ph.tn.tudelft.nl
P.O. Box 5046, 2600 GA Delft
The Netherlands


--------------------------------------------

Date: Fri, 18 Nov 94 
From: Gregory Piatetsky-Shapiro (gps@gte.com)
Subject: Wall Street Journal on the Dark Side of database marketing

Wall Street Journal, Oct 3, 1994, P. A20, article "Avoid Dark Side of
Database Marketing" by James Rosenfield, discusses this new buzzword
of top management.  The goal of database marketing is to allow management
to use computer-generated information about customer's buying decisions
to pinpoint those people most likely to buy a product.  The potential 
is great (see Business Week, Sept 1994 and KDD Nugget 94:17). The 
article warns to avoid these pitfalls:

1. Control the technology: technology can run away and assume a life of
its own. 

2. Don't try to assemble everything about the customer -- focus on 
relevant information.  As the representative from a dog-food manufacturing
company said, "Make sure they have dogs". (GPS: however, what if there is 
no data on dogs, but it can be inferred with some probability from other
variables ?.  I don't quite agree with this point)

3. Worry about privacy. Otherwise the customers and the courts 
won't like it a bit!

4. Beware of the word "mid-tech".  When the trappings of high tech
are present, but fail to deliver their promise, you have "mid-tech". 

5. Build your database, but don't forget your brand. 
Good marketing technology is no substitute for good product !