KDD Nugget 94:20, e-mailed 94-11-18 Contents: * M. Moulet, CFP: Statistics, ML, and KDD workshop in Europe, April 95 * P. Godfrey, Tech. report on Failing Database Queries * X. Wu, KDD Tutorial Overheads Available by FTP * Blix, AI/ML oriented WWW page * Bob Duin, WWW page for pattern recognition * G. Piatetsky-Shapiro, WSJ: Dark Side of Database Marketing The KDD Nuggets is a moderated list for the exchange of information relevant to Knowledge Discovery in Databases (KDD, also known as Data Mining), e.g. application descriptions, conference announcements, tool reviews, information requests, interesting ideas, clever opinions, etc. It has been coming out about every two-three weeks, depending on the quantity and urgency of submissions. Back issues of nuggets, a catalog of data mining tools, useful references, FAQ, and other KDD-related information are now available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail contributions to kdd@gte.com Add/delete requests to kdd-request@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "This class was a religious experience for me... I had to take it all On faith." "What's the quality of the text? 'Text is printed on high quality paper.'" from MIT Course Evaluation Guide, forwarded by Sam Uthurusamy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Fri, 4 Nov 1994 15:59:27 +0100 From: Marjorie Moulet Subject: cfp Call For Papers for MLnet Sponsored Familiarization Workshop: Statistics, Machine Learning, and Knowledge Discovery In Databases Heraklion, Crete, Greece, April 28-29, 1995 ---------------------------------------------- Submission Deadline: Feb 3, 1995. Introduction The intersection of, and interaction between Machine Learning (ML), statistics and knowledge discovery is a rapidly growing area of interest. There are several areas of common research. On one hand, the considerable amount of data available in databases is an extraordinary source for extracting knowledge and tracing causal dependencies and for this reason, Knowledge Discovery in Databases (KDD) has become an important and new challenge, economically as well as scientifically. On the other hand, the statistical and machine learning algorithms can be considered as robust instruments for KDD. These facts make obvious that the KDD is an interdisciplinary concept benefiting from statistics, machine learning and database technology. The success of the previous workshops on KDD at IJCAI89 and the AAAI (91, 93, 94) Conferences as well as the MLnet Workshop on Machine Learning and Statistics confirm a growing awareness and interest in this area. The main topics of interest: * Statistical and ML approaches for discovery of causal structures in databases * Incremental learning methods to handle the temporal aspects of databases * Statistical and ML approaches for comprehensibility and Validation of discovered knowledge * Software architecture of KDD-Tools * Combining of ML and statistical approaches to build hybrid algorithms for KDD * Focusing aspects to improve the performance of KDD-Tools * Classification and prediction algorithms to handle large-scale data * Successful applications in Medicine, Business and Industry Form of the workshop The main part of the workshop will be devoted to presentations and invited talks. Talks will be grouped according to common interest with plenty of time for discussions. Format and kind of contribution Contributions are invited which present theoretical or practical results as well as review papers in one of the above topics of interest. Submitted papers must not exceed 12 pages, including abstract and bibliography. The title page should include the title of the talk, name(s) and affiliation(s) of the author(s) and a list of keywords as well as the full address (including E-Mail) of the first author. Proceedings will be available at the workshop. The submission of an electronic version (postscript or LaTeX) of a paper is highly recommended. Participation without submitting a full paper is possible but we encourage the all particpants to submit an abstract (up to two pages) to help the organisers to clarify the topics of interests. A sample paper in LaTeX is available via ftp from amsta.leeds.ac.uk in directory pub/ecml95. Contributions to the workshops should be sent to the coordinator (see below the address) by 3rd February'95. Decisions on acceptance will be made by 3rd March (early registration for ECML'95 closes on 17th February). Final camera-ready versions of the revised papers should be sent to the coordinator of the individual workshops by 24th March. Submissions and queries should be sent to the coordinator: Gholamreza Nakhaeizadeh Daimler Benz AG Research and Technology / F3W Postfach 2360 D-89013 Ulm Germany Email: nakhaeizadeh@dbag.ulm.DaimlerBenz.COM Phone: +49 731 505 2860 Fax: +49 731 505 4210 The workshop will be hosted at University of Heraklion, Crete, Greece. It will take place immediately after the European Conference on Machine Learning (ECML95) at April 28-29, 1995. Funding for travelling and living expenses are available according to the rules of the ML-net work of excellence. General enquiries about the workshop should be addressed to Derek Sleeman (Aberdeen) and about local arrangements to Vassilis Moustakis (Crete). Chairs Yves Kodratoff, University of Paris-Sud, France Gholamreza Nakhaeizadeh, Daimler-Benz, Ulm, Germany Charles Taylor, University of Leeds, GB Program Committee Peter Edwards, University of Aberdeen, GB Attilio Giordana, University of Torino, Italy Usama Fayyad, Jet Propulsion Lab, USA Bob Henery, University of Strathclyde, GB Willi Klosgen, GMD, Germany Heikki Mannila, University of Helsinki, Finland Marjorie Moulet, Orsay, France Gregory Piatetsky-Shapiro, GTE, USA Arno Siebes, CWI, Amsterdam, The Netherlands Rudiger Wirth, Daimler-Benz, Ulm, Germany Jan Zytkow, Wichita State University, USA -------------------------------------------- Return-Path: To: KDD Nuggets Moderator Subject: Finding the failure points of failing queries to databases Date: Mon, 07 Nov 1994 23:11:48 -0500 From: Parke Godfrey I am submitting this to the list as I believe it may be relevant to the KDD community. I have written a paper on the finding minimal failing subqueries (MFSs) to failing queries in databases. Parke Godfrey, Minimization in Cooperative Response to Failing Database Queries, Technical Report (CS-TR-3348 and UMIACS-TR-94-108), CSD and UMIACS, University of Maryland at College Park, September 1994. Submitted to the International Journal of Intelligent and Cooperative Information Systems. The abstract is below. At Maryland, we have been engaged in the areas of cooperative answering and cooperative database systems. The premise is that current database systems are often difficult to use because the answers they provide, while correct, are often misleading. Cooperative answering seeks to alleviate this problem by providing more informative responses to queries above and beyond the literal answers to avoid misleading the user. One question that arises is what to do when a database query fails. In the least, it seems reasonable to identify the failure points of the query to report back to the user. So we should not just say that the query failed, but also report which parts of the query failed. Given a conjunctive query (a set of conditions to be satisfied), this means to find minimal subsets of the query (subqueries) that fail. There has been a fair amount of interest in this technique in the literature. These MFSs are also often called "false presuppositions" of the query. The complexity profile of the problem of finding the MFSs of a query is surprising. This paper maps out what the complexity is, and it develops a good algorithm for finding the MSFs; no good (tractable) algorithm has existed before. I believe that the work people are doing in cooperative database systems is closely related to work in KDD. We must analyze queries and the data-store to devise cooperative information (semantic information). In KDD, the data-store is analyzed for the purpose of finding interesting trends and properties (semantic information). One potent tool to probe the data-store is to ask queries. An approach to finding candidate integrity constraints (ICs) over a database is to ask queries one suspects will fail. If the query does fail, it may be that it characterizes the database in general, and should be promoted as an IC. (Of course, this method looks for absolute ICs and is not good for finding probabilistic rules. It can only find strong rules.) However, it is not the failing query itself that we should consider as a possible IC; it is its MFSs that we should consider. I believe that the algorithms I present in the paper might be amenable to such a KDD technique. Our papers on cooperative database systems can be found at our WWW site, including the MFS paper I am advertising here. URL = http://karna.cs.umd.edu:3264/ They are also available by anonymous FTP in ftp.cs.umd.edu:/pub/prism/ If anyone is interested in this work or has comments on it, please let me know. I would appreciate to hear any questions, ideas, comments, and criticisms, and of related works. Thank you. -- Parke Godfrey Department of Computer Science University of Maryland ============================================================================== Minimization in Cooperative Response to Failing Database Queries Parke Godfrey When a query fails, it is more cooperative to identify the cause of failure, rather than just to report the empty answer set. If there is not a cause for the query's failure, it is worthwhile to report the part of the query which failed. To identify a minimal failing subquery (MFS) of the query is the best way to do this. (This MFS is not unique; there may be many of them.) Likewise, to identify a maximal succeeding subquery (MSS) can help a user to recast a new query that leads to a non-empty answer set. Database systems do not provide the functionality of these types of cooperative responses. This may be, in part, because algorithmic approaches to finding the MFSs and the MSSs to a failing query are not obvious. The search space of subqueries is large. Despite work on MFSs in the past, the algorithmic complexity of these identification problems had remained uncharted. This paper shows the complexity profile of MFS and MSS identification. It is shown that there exists a simple algorithm for finding a MFS or a MSS by asking N subsequent queries, in which N is the length of the query. To find more MFSs (or MSSs) can be hard. It is shown that to find O(N) MFSs (or MSSs) is NP-hard. To find k MFSs (or MSSs), for a fixed k, remains polynomial. An optimal algorithm for enumerating MFSs and MSSs, ISHMAEL, is developed and presented. The algorithm has ideal performance in enumeration, finding the first answers quickly, and decaying toward intractability in a predictable manner as more answers are found. The complexity results and the algorithmic approaches given in this paper should allow for the construction of MFS and MSS facilities for database systems. These results are relevant to a number of problems outside of databases too, and may find further application. -------------------------------------------- Date: Thu, 17 Nov 1994 17:01:55 +1100 (EST) From: Xindong Wu Subject: KDD Tutorial Overheads Available by FTP KDD Tutorial Overheads Available by FTP The overheads for the AI'94 Tutorial on Intelligent Learning Database Systems are now available by anonymous ftp from coral.cs.jcu.edu.au in pub/HCV/KDD.ps. The following is an outline of the tutorial. Knowledge acquisition from databases is a research frontier for both database technology and machine learning (ML) techniques, and has seen sustained research over recent years. It also acts as a link between the two fields, thus offering a dual benefit. Firstly, since database technology has already found wide application in many fields, ML research obviously stands to gain from this greater exposure and established technological foundation. Secondly, ML techniques can augment the ability of existing database systems to represent, acquire, and process a collection of expertise such as those which form part of the semantics of many advanced applications (e.g. CAD/CAM). This full-day tutorial presents and discusses techniques for the following 3 interconnected phases in constructing intelligent learning database systems: (1) Translation of standard database information into a form suitable for use by a rule-based system; (2) Using machine learning techniques to produce rule bases from databases; and (3) Interpreting the rules produced to solve users' problems and/or reduce data spaces. It suits a wide audience including postgraduate students and industrial people from databases, expert systems, and machine learning. Comments and suggestions for improvements are solicited! Xindong Wu -------------------------------------------- Date: Fri, 4 Nov 1994 15:44:51 -0600 Subject: AI/ML oriented WWW page From: blix@cs.uiuc.edu (via ML List) The AI Group and the Inductive Learning Group at the University of Illnois maintains a World Wide Web page of useful resources for AI researchers in general and machine learning researchers in particular at: http://www-ilg.cs.uiuc.edu/info/ that is currently stable enough that we think it can go public. Of particular interest is a relatively up-to-date set of call for conferences and workshops. The rest contains mainly links to other useful sources of information. Feel free to use it, and let me know if you have suggestions or additions; they will be taken care of as time permits. -------------------------------------------- Date: Tue, 1 Nov 94 08:36:44 +0100 From: Bob Duin (via ML List) Subject: WWW page for pattern recognition To researchers and students in paterrn recognition, Those who are interested to approach the WWW from the PR point of view and have direct access to some services as well, might try: http://galaxy.ph.tn.tudelft.nl:2000/PRInfo.html Suggestions are welcomed, Bob Duin R.P.W. Duin Phone: (31) 15 786143 Faculty of Applied Physics Fax: (31) 15 626740 Delft University of Technology E-mail: duin@ph.tn.tudelft.nl P.O. Box 5046, 2600 GA Delft The Netherlands -------------------------------------------- Date: Fri, 18 Nov 94 From: Gregory Piatetsky-Shapiro (gps@gte.com) Subject: Wall Street Journal on the Dark Side of database marketing Wall Street Journal, Oct 3, 1994, P. A20, article "Avoid Dark Side of Database Marketing" by James Rosenfield, discusses this new buzzword of top management. The goal of database marketing is to allow management to use computer-generated information about customer's buying decisions to pinpoint those people most likely to buy a product. The potential is great (see Business Week, Sept 1994 and KDD Nugget 94:17). The article warns to avoid these pitfalls: 1. Control the technology: technology can run away and assume a life of its own. 2. Don't try to assemble everything about the customer -- focus on relevant information. As the representative from a dog-food manufacturing company said, "Make sure they have dogs". (GPS: however, what if there is no data on dogs, but it can be inferred with some probability from other variables ?. I don't quite agree with this point) 3. Worry about privacy. Otherwise the customers and the courts won't like it a bit! 4. Beware of the word "mid-tech". When the trappings of high tech are present, but fail to deliver their promise, you have "mid-tech". 5. Build your database, but don't forget your brand. Good marketing technology is no substitute for good product !