KDD Nuggets 95:3, e-mailed 95-02-10 Contents: * GPS, KDD-95 conference -- 3 weeks till paper submission deadline! * GPS, Database Mining (tm) is term trademarked by HNC * E. Irani, Classifier evaluation -- Ph.D. abstract * T. Finin, faculty position related to KDD * D. O'Leary, IEEE CAIA 1995 -- Conference Preview * B. Wuthrich, new report on Knowledge Discovery in Databases The KDD Nuggets is a moderated mailing list for news and information relevant to Knowledge Discovery in Databases (KDD), also known as Data Mining, Knowledge Extraction, etc. Relevant items include tool announcements and reviews, summaries of publications, information requests, interesting ideas, clever opinions, etc. ******** Note for Submissions ******************************************** * Please have a descriptive Subject line in your contribution, * * e.g. A nearest monster algorithm application to the Loch Ness problem * * or a ABCD-95 workshop on non-monotonic discovery of data in knowledge * * instead of "Subject: a submission" or "Subject: a workshop" * * * * Workshop, Conference, and other Meetings announcements should be * * relevant to Knowledge Discovery in Databases. * ************************************************************************** Nuggets frequency is approximately bi-weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are now available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quotes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some people are like foxes -- they know many little things, and others are like a hedgehog -- they know one big thing. Isaiah Berlin ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: KDD-95 Update Date: Jan 30, 1995 Reminder -- The Knowledge Discovery and Data Mining (KDD-95) Conference is getting close. The paper submission deadline is March 3, 1995. The first submission has already been received! Here is the submission info (see KDD Nuggets 94:22 or http://info.gte.com/~kdd/kdd95.html for full call for papers and details). Please submit 5 *hardcopies* of a short paper (a maximum of 9 single-spaced pages not including cover page but including bibliography, 1 inch margins, and 12pt font) by March 3, 1995. A cover page must include author(s) full address, E-MAIL, a 200 word abstract, and up to 5 keywords. This cover page must accompany the paper. IN ADDITION, an electronic version of the cover page MUST BE SENT BY E-MAIL to kdd95@aig.jpl.nasa.gov by March 3, 1995. Please mail the papers to : KDD-95 AAAI 445 Burgess Drive Menlo Park, CA 94025-3496 U.S.A. send e-mail queries regarding submissions logistics to: kdd@aaai.org -------------------------------------------- From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: Database Mining (tm) is term trademarked by HNC Date: Tue, 7 Feb 1995 I have received an official letter from HNC Software Inc, informing me that HNC has the trademarked the term "DATABASE MINING (tm)" and that this term should not be used in KDD Nuggets except in reference to their product. From now on, I will use the terms "Knowledge Discovery in Databases" and "Data Mining" instead of the offending term. Historically, the term "Data Mining" has long been used in statistics (since 1960-s ??) to describe an approach of looking for patterns in data without an apriori hypothesis. HNC, according to their letter, has begun using "Database Mining (tm)" on June 30, 1989 and has received a trademark on June 30, 1992. The term "Database Mining (tm)" is very broad and using it as a trademark makes as much sense as trademarking the terms "Machine Learning" or "Artificial Intelligence". Nevertheless, since they do have a trademark, I will not use the offending term. However, I am very interested in published uses of the term "Database Mining (tm)" before June 30, 1989. If you know of any, please let me know. -- Gregory Piatetsky-Shapiro -------------------------------------------- Date: Mon, 6 Feb 1995 From: "Erach A. Irani" Subject: Classifier evaluation Classifier evaluation and the use of algorithmic classifiers with expert system classifiers Erach A. Irani, Doctoral Thesis, accepted June 1994. University of Minnesota, Minneapolis. Abstract Classification is an important task in the real-world. It is important to develop techniques for the comparison of classifiers. This has to be done statistically and in terms of well-defined criteria. This is a major goal of this thesis. This thesis discusses such techniques specifically in the context of comparing algorithmic classifiers (algorithms) and expert system classifiers (expert systems). The basic criterion used in this thesis is $L$, the weighted loss of mis-classification. $L$ is a well-established criterion in the statistical and pattern-recognition literature. A relatively unknown criterion by Brennan, kappa base ($\kappa_b$) that adjusts for agreement attainable due to chance, is introduced from the biomedical and psychological literature. $\kappa_b$ is extended to $\kappa_{bw}$ for the case when the mis-classifications are of varying degrees of severity. The importance of using statistical techniques such as hypothesis testing and confidence intervals is stressed. The use of cross-validation, the grouped jackknife, and the bootstrap for estimating the value, standard error, and other properties of a criterion is shown. A deficiency in the bootstrap method for estimating Expected Excess Error of a trainable classifier is proved. The split-recombination (SR) conjecture is introduced to estimate properties of a trainable classifier and three new techniques, the split multinomial (SM), split grouped jackknife (SGJK), and the split bootstrap (SB), are introduced. A simple experiment on artificial data consisting of two classes from two uni-variate normal populations is done to demonstrate the plausibility of the SR conjecture. The Classifier-based Overall Loss Minimization Strategy (COLMS) and the Overall Loss Minimization Strategy (OLMS) are introduced to combine classifier evaluations to improve $L$ and $\kappa_{bw}$. The COLMS is an independent discovery and development of the coupling procedure for classifiers introduced by Wernecke. The OLMS extends the ideas in the COLMS to accommodate arbitrary partitioning algorithms. These ideas are demonstrated on real-world atherosclerosis evaluation data from the Program On the Surgical Control of the Hyperlipidemias (POSCH) and four real-world classifiers viz. an expert system ESCA, multiple linear regression, back-propagation, and C4.5 (an implementation of ID3). An Expert Defined Domain-Specific Strategy (EDSS) for combining classifiers is compared with the COLMS. The use of an expert-defined test (EDT) is introduced to further increase confidence in a classifier, especially an algorithmic classifier. The EDT examined was based on perturbation testing. The domain experts agreed after viewing the results of the rigid statistical evaluation and the performance on the EDT of ESCA and multiple linear regression that multiple linear regression could be used instead of ESCA for the subset of the POSCH atherosclerosis evaluation task that ESCA performed. -------------------------------------------- From: Timothy Finin Date: Mon, 6 Feb 1995 20:22:16 -0500 Subject: KDD faculty position University of Maryland Baltimore County Computer Science Department Electrical Engineering Department The University of Maryland Baltimore County (UMBC) invites applications for faculty positions at all levels for joint tenure-track and term appointments in the Departments of Computer Science and Electrical Engineering. Of particular interest are candidates with backgrounds in database systems, information retrieval, parallel and distributed information systems, mediated software architectures, KNOWLEDGE DISCOVERY IN DATABASES, scientific visualization, virtual environments, multi-media information and communication systems, communication protocols, digital library applications, compression, enhancement, and recognition of data and images, performance evaluation of communication networks and links, processing, transmission, and storage of integrated voice, data, graphics and video. The Departments of Computer Science and Electrical Engineering consist of 25 full-time and 32 adjunct faculty, offer BS, MS, and Ph.D. degrees, and have approximately 210 graduate and 650 undergraduate students. Strong research programs exist in both departments with grant support from industry and government agencies including ARPA, AFSOR, ARL, NASA, NIST, NOAA, NSA, NSF, ONR, IBM, Unisys, Sprint, Bell Labs, and Martin Marietta Labs. Substantial opportunities exist for collaborative efforts with NASA scientists through the Center of Excellence in Space Data and Information Sciences (CESDIS) operated by Universities Space Research Association (USRA) at NASA's Goddard Space Flight Center in Greenbelt, Maryland, a 30 minute drive from the UMBC campus. The CESDIS mission is to build a national group of researchers interested in collaborating on research in key problem areas which affect NASA's efforts to collect, manage, store, and process massive Earth and space science data sets. The UMBC campus has 10,500 students and is attractively located in the Baltimore-Washington corridor, providing easy access to both metropolitan areas and to numerous federal agencies and industrial research centers. UMBC is joined at the graduate level with the University of Maryland at Baltimore (UMAB), resulting in the University of Maryland Graduate School Baltimore with combined research funding of over $140M. Applications will be accepted through May 1, 1995. Address application letter and CV to: CS/EE Faculty Search, Department of Computer Science, University of Maryland Baltimore County, Baltimore, MD 21228-5398. Arrange for three letters of reference to be submitted to the same address. For additional information, send e-mail to search-csee@cs.umbc.edu or access "http://www.cs.umbc.edu/". UMBC is an affirmative action/equal opportunity employer. -------------------------------------------- Date: Thu, 26 Jan 1995 10:37:27 -0800 From: Dan Oleary Subject: IEEE CAIA 1995 -- Conference Summary IEEE Conference for AI Applications (CAIA) 1995 will be held in Los Angeles, February 20-22. Plenary talks will be given each day of the conference. The first day of the conference includes discussions by Judea Pearl (UCLA) (graphical models), Mario Schkolnick (head of IBM's efforts on knowledge discovery) and an editor panel, including Steve Cross (CMU & IEEE Expert) and Ramesh Patil (USC-ISI & AI Magazine). The second day of the conference focuses on the use of AI at JPL (Usama Fayyad) and ISI (Herb Schorr), with particular emphasis on applications. The second day continues, with a discussion on subjective probability, with a plenary talk by Amos Tversky (Stanford). The final day of the conference focuses on knowledge discovery and soft computing. Se June Hong (IBM) presents a talk on using contextual information. Most of the morning will be spent on a panel discussing AI and Soft Computing. The panel, chaired by Bernadette Bouchon-Meunier, includes Lotfi Zadeh (UC - Berkeley), I. R. Goodman (Naval Ocean Systems Center), Abraham Kandel (USF), Hung Nguyen (New Mexico State), Anca Ralescu (University of Cincinnati, and LIFE, Japan), Enrique Ruspini (SRI), Ronald Yager (Iona College), and John Yen (Texas A&M). Each day of the conference also will include submitted papers each afternoon. The conference is preceded, on February 19 with tutorials and workshops. If you have questions contact either Dan O'Leary (oleary@rcf.usc.edu) or John Mee (j.mee@computer.org). Information is available at url gopher://cwis.usc.edu:70/11/University_Information/Academic_Departments/ Business_Administration/Research/IEEE_CAIA -------------------------------------------- Return-Path: Date: Wed, 8 Feb 95 18:00:20 HKT From: beat@cs.ust.hk (DR. BEAT WUTHRICH) To: kdd@gte.com Subject: new version of TR Cc: beat@cs.ust.hk Dear Gregory I prepared a new version of my tech rep "Knowledge Discovery in Databases". It is now extended, and a lot of former typos are out. It contains 101 pages. the tech rep HKUST-TRCS-95-4 is accessible via ftp. 1) `ftp ftp.cs.ust.hk` 2) login as: anonymous 3) cd pub/techreport/postscript 4) get tr95-4.ps.gz then use gunzip to decode the file if you like you could announce it under kdd%eureka@gte.com Best regards, Beat Dr. Beat Wuthrich Assistant Professor, CS Dept The Hong Kong University of Science and Technology Clear Water Bay Kowloon, Hong Kong Tel. (852) 2358 7013 Fax: (852) 2358 1477 email: beat@cs.ust.hk ------------------------------------------------------------