KDD Nugget 94:17, e-mailed 94-09-20 Contents: * R. Valdes-Perez, CFP: Systematic Methods Of Scientific Discovery * M. J. Wolfe, Large Supermarket Dataset available * M. Holsheimer, Two KDD-94 papers available by ftp * AI-Stats list, Making Sense of Data: Short Course, Oct 13-14, 1994 * G. Piatetsky-Shapiro, Business Week cover story on Database Marketing The KDD Nuggets is a moderated list for the exchange of information relevant to Knowledge Discovery in Databases (KDD, also known as Data Mining), e.g. application descriptions, conference announcements, tool reviews, information requests, interesting ideas, clever opinions, etc. It has been coming out about every two-three weeks, depending on the quantity and urgency of submissions. Back issues of nuggets, a catalog of data mining tools, useful references, FAQ, and other KDD-related information are now available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail contributions to kdd@gte.com Add/delete requests to kdd-request@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "We trained hard but it seemed that everytime we were beginning to form up into teams we would be reorganized. I was to learn later in life that we tend to meet any new situation by reorganizing and a wonderful method it can be for creating the illusion of progress while producing confusion, inefficiency and demoralization" Gaius Petronius 210 B.C. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 15 Aug 94 10:28:32 EDT From: Raul.Valdes-Perez@CARMEN.KBS.CS.CMU.EDU Subject: Symposium call AAAI 1995 Spring Symposium Series March 27 - 29, 1995 Stanford University, California Call for Participation Sponsored by the American Association for Artificial Intelligence 445 Burgess Drive Menlo Park, CA 94025 (415) 328-3123 sss@aaai.org The American Association for Artificial Intelligence presents the 1995 Spring Symposium Series, to be held Monday through Wednesday, March 27 - 29, 1995, at Stanford University. The topics of the nine symposia in the 1995 Spring Symposium Series are: o Empirical Methods in Discourse Interpretation and Generation o Extending Theories of Action: Formal Theory and Practical Applications o Information Gathering from Heterogeneous, Distributed Environments o Integrated Planning Applications o Interactive Story Systems: Plot and Character o Lessons Learned from Implemented Software Architectures for Physical Agents o Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity o Representing Mental States and Mechanisms o Systematic Methods of Scientific Discovery Symposia will be limited to between forty and sixty participants. Each participant will be expected to attend a single symposium. Working notes will be prepared and distributed to participants in each symposium. A general plenary session, in which the highlights of each symposium will be presented, will be held on Tuesday, March 28, and an informal reception will be held on Monday, March 27. In addition to invited participants, a limited number of other interested parties will be able to register in each symposium on a first-come, first-served basis. Registration will be available by January 1, 1995. To obtain registration information write to the AAAI at 445 Burgess Drive, Menlo Park, CA 94025 (sss@aaai.org). Submission Dates: Submissions for the symposia are due on October 28, 1994. Notification of acceptance will be given by November 30, 1994. Material to be included in the working notes of the symposium must be received by January 20, 1995. See the appropriate section below for specific submission requirements for each symposium. This document is available as http://www.ai.mit.edu/people/las/aaai/sss-95/sss-95-cfp.html ********************************************************************** SYSTEMATIC METHODS OF SCIENTIFIC DISCOVERY TITLE: Systematic Methods of Scientific Discovery DESCRIPTION: Scientific discovery is surely among the most celebrated creative processes. Discovery receives scholarly attention from several disciplines, not least of which is artificial intelligence. AI has explored the view that much of scientific reasoning is problem solving, and hence is akin to more ordinary types of reasoning. Experience has shown that some scientific reasoning can be automated: research on discovery has already yielded competent programs that, e.g.,, plan organic syntheses, elucidate molecular structure, determine reaction mechanisms, make interesting graph-theoretic conjectures, and detect patterned behavior. Where all this may lead was foreseen by Allen Newell [Artif.Intell.; 25(3) 1985]: [The field] should, by the way, be prepared for some radical, and perhaps surprising, transformations of the disciplinary structure of science (technology included) as information processing pervades it. In particular, as we become more aware of the detailed information processes that go on in doing science, the sciences will find themselves increasingly taking a metaposition, in which doing science (observing, experimenting, theorizing, testing, archiving, ...) will involve understanding these information processes, and building systems that do the object-level science. Then the boundaries between the enterprise of science as a whole (the acquisition and organization of the knowledge of the world) and AI (the understanding of how knowledge is acquired and organized) will become increasingly fuzzy. The goals of this symposium are to examine how far we have come to realizing Newell's vision, to identify fruitful current opportunities, and to discuss the obstacles to progress in further understanding of systematic methods for scientific inference. We solicit contributions that advance these goals. Some examples of appropriate contributions include: o A program that automates a complex and creative scientific task. o A new systematic method of scientific inference, even if its automation is not yet feasible. o A new representation or classification of science that enhances efforts to systematize it. o New opportunities for known systematic methods. o A recent scientific achievement where the computer played an essential creative role. o New heuristics for scientific research, e.g., that promise to make practicable some aspect of automated scientific reasoning. o Computational models of historical discoveries in science. o Cognitive studies of the scientific process that promise to contribute to computational approaches. Contributions that potentially bear on more than one scientific area and that are demonstrably effective are of special interest. SUBMISSION INFORMATION: Prospective participants are invited to submit (in paper form) one of the following to the symposium chair: three copies of an extended abstract (at most 5 pages) of work to be presented, a description of research in progress, or a statement describing what you hope to contribute to and gain from the symposium. Please send submissions and information requests to Raul Valdes-Perez, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 - USA (phone: 412 268-7127, fax: 412 621-5117, email: valdes@cs.cmu.edu). ORGANIZING COMMITTEE: Lindley Darden Maryland Joshua Lederberg Rockefeller Herbert Simon Carnegie Mellon Derek Sleeman Aberdeen Raul Valdes-Perez (chair) Carnegie Mellon ----------------------------- Date: Tue, 06 Sep 1994 21:33:22 -0400 (EDT) From: MJWOLFE@delphi.com Subject: Re: info. about this data The dataset comes from UPC or Point-of-Sale Scanners in 3000+ supermar- kets throughout the U.S. It is aggregated into 54 "city-markets", 160+ weekly observations x 50-250 brands (depending on the category) and several "dependent - type " measures (i.e. unit or dollar sales or mar- ket share) along with 100+ "causal" or independent variable measures (i.e. price (promoted & non-promoted), advertising (print magazine, television or retailer featre advertising), merchandising (end-cap displays) and mfr. coupons)...The data can also be cross tabbed by demographic factror or by retail chain (the latter might not be available dure to reporting limitations).... This data set is rich with "econcometiric and consumer behavior" possi- bilities...nothing of this depth or dimensionality exists in any other industry....it's avaialable at a reduced cost to academics/researchers with an interest in applying such data to test hypotheses and pursue various knowlege expanding or academic interests...it is not intended for commercial or for-profit ventures... sorrry about the spelling...I'm not a ppoor speller..it's just htis *%#! echo and lack of an editor...this should be good enought for your particular purposes...thanks....Michael Wolfe ----------------------------- Subject: Data Mine publications Date: Wed, 07 Sep 1994 10:56:52 +0200 From: Marcel Holsheimer Hi Gregory, could you add following message to the upcoming KDD-nuggets? Best regards, Marcel. Two papers, published at the KDD94 workshop are available on the CWI ftp-site: Report CS-R9429: @inproceedings{, author={Marcel Holsheimer and Martin L. Kersten}, title={Architectural Support for Data Mining}, booktitle={Proc. of the {AAAI-94} workshop on Knowledge Discovery in Databases}, address={Seattle, Washington}, pages={217 -- 228}, year={1994}} Abstract One of the main obstacles in applying data mining techniques to large, real-world databases is the lack of efficient data management. In this paper, we present the design and implementation of an effective two-level architecture for a data mining environment. It consists of a mining tool and a parallel DBMS server. The mining tool organizes and controls the search process, while the DBMS provides optimal response times for the few query types being used by the tool. Key elements of our architecture are its use of fast and simple database operations, its re-use of results obtained by previous queries, its maximal use of main-memory to keep the database hot-set resident, and its parallel computation of queries. Apart from a clear separation of responsibilities, we show that this architecture leads to competitive performance on large data sets. Moreover, this architecture provides a flexible experimentation platform for further studies in optimization of repetitive database queries and quality driven rule discovery schemes. CR subject classification (1991): Data storage representations (E.2), Database systems (H.2.4) parallel systems, query processing, Information search and retrieval (H.3.3), Learning (I.2.6) induction, knowledge acquisition Keywords & Phrases: data mining, parallel databases, inductive learning, knowledge discovery in databases ==================================================================== Report CS-R9430: @inproceedings{, author={Arno Siebes}, title={Homogeneous Discoveries Contain no Surprises: Inferring Risk-profiles from Large Databases} booktitle={Proc. of the {AAAI-94} workshop on Knowledge Discovery in Databases}, address={Seattle, Washington}, pages={97 -- 108}, year={1994}} Abstract Many models of reality are probabilistic. For example, not everyone orders crisps with their beer, but a certain percentage does. Inferring such probabilistic knowledge from databases is one of the major challenges for data mining. Recently Agrawal et al. investigated a class of such problems. In this paper a new class of such problems is investigated, viz., inferring risk-profiles. The proto-typical example of this class is: ``what is the probability that a given policy-holder will file a claim with the insurance company in the next year''. A risk-profile is then a description of a group of insurants that have the same probability for filing a claim. It is shown in this paper that homogeneous descriptions are the most plausible risk-profiles. Moreover, under modest assumptions it is shown that covers of such homogeneous descriptions are essentially unique. A direct consequence of this result is that it suffices to search for the homogeneous description with the highest associated probability. The main result of this paper is thus that we show that the inference problem for risk-profiles reduces to the well studied problem of maximising a quality function. CR subject classification (1991): Computer based methods in probability and statistics (G.3), Database applications (H.2.8), Information search and retrieval (H.3.3) clustering, search process, Learning (I.2.6) concept learning, induction, knowledge acquisition Keywords & Phrases: Data Mining, Probabilistic Knowledge, Probabilistic Search, Probability Theory ==================================================================== Both reports can be retrieved by anonymous ftp: & ftp ftp.cwi.nl Name (ftp.cwi.nl:marcel): ftp 331 Guest login ok, send ident (your e-mail address) as password. Password: ftp> binary ftp> cd pub/CWIreports/AA ftp> get CS-R9429.ps.Z ftp> get CS-R9430.ps.Z ftp> bye ________________________________________________________________________ Marcel Holsheimer | Centre for Mathematics and Computer Science (CWI) phone +31 20 592 4134 | Kruislaan 413, Amsterdam, The Netherlands -------------------------------------------- >From <@watstat.uwaterloo.ca:elder@masc4.rice.edu> Wed Sep 7 16:38:19 1994 Subject: Short Course Announcement To: ai-stats@watstat.uwaterloo.ca Cc: rbic@ruf.rice.edu, elder@masc4.rice.edu (John Elder), hess@access.digex.net (Paul Hess) Content-Length: 4721 Making Sense of Data: Computer-Aided Pattern Discovery A Rice University Short Course, October 13-14, 1994 Is there useful information hidden in your collection of data? How can you extract patterns and trends to classify a new case and give it contextual meaning? This two-day intensive short course will provide an introduction to new technologies in computer-aided data analysis and "machine learning." These new methods can be applied to many everyday scientific, engineering, and financial problems, such as medical diagnoses, manufacturing quality control, oil exploration, and cost estimation. Research scientist Dr. John Elder will begin with a review of basic statistical concepts, then explain, step by step, how new and accessible computer technology can enable researchers to make sense of data and forecast outcomes based on sample cases. Such methods as neural networks, decision trees, regression networks, and kernels can often discover useful structure in large or small databases, especially when combined with scientific visualization. A number of powerful inductive modeling techniques are now available, both commercially and through research laboratories. The instructor will present a selection of these methods, explain their strengths and weaknesses, and demonstrate how to use them effectively. Who Should Attend: Engineers, scientists, medical researchers and business people who wish to understand the most recent technological developments in pattern discovery and inductive modeling. At the conclusion of this course, they should be able to discern the strengths of competing methods and be able to select and use the appropriate tools for their applications. Participants should have some working experience with computers and/or some knowledge of statistics. Course Outline: Pattern Discovery: An Overview Inducing Models from Data: Major Issues Review of Key Statistical Concepts Scientific Visualization Multidimensional Optimization Local and Global Regression and Subset Selection Fixed, Stepwise, Branch and Bound Polynomial Networks (GMDH, AIM) Decision Trees (CART, TX2step) Neural Networks Case-based Methods Nearest Neighbors, Kernels, Radial Basis Functions Related Hot Topics Expert Systems, Fuzzy Logic, Genetic Algorithms Examples of Applications Diagnosing breast cancer Classifying bat species Estimating rocket engine temperature Investing in the bond market Instructor: Dr. John Elder is a research scientist in the Department of Computational and Applied Mathematics and at the Center for Research on Parallel Computing at Rice University. He has been researching and applying inductive methods for more than a decade in industry and academia, and is an active consultant to industry. He has been a research scientist for an engineering consulting business as well as director of research for an investment management firm. Dr. Elder is the author of two chapters in a forthcoming book on neural networks and has published papers on forecasting methods in scientific and business journals. He has also written dozens of papers on innovations in multidimensional optimization, machine learning, feature selection, adaptive regression, scientific visualization, and statistical applications to financial markets. A member of the Institute of Electrical and Electronics Engineers, the American Institute of Aeronautics and Astronautics, and the American Statistical Association, Dr. Elder holds a Ph.D. in systems engineering from the University of Virginia. When and Where: Thursday-Friday, October 13-14, 9:00 a.m.-12:00 noon and 1:00-4:00 p.m. on the Rice University campus, 6100 South Main, Houston, Texas. Fee: $185. Fee includes lecture notes and background materials. Discount price information on a selection of commercial software will be available to participants. Refund Policy: Course fee will be refunded in full if enrollment is canceled in writing by October 6. If you drop the course after October 6 but before October 13, a refund will be issued only if a replacement for you can be found, and a 20 percent processing fee will be deducted from your refund. Refunds will not be issued after the course begins. Accommodations: Out-of-town registrants will receive a list of hotels located near the Rice campus with their enrollment acknowledgments. A campus map will also be included. To Register: Call the Rice University School of Continuing Studies, (713) 527-4803, or fax your request for a registration form to (713) 285-5213, or e-mail: rbic@ruf.rice.edu (Rhonda Bice). Dr. Elder can be reached at elder@rice.edu ------------------------------ Date: 20 Sep 94 From: Gregory Piatetsky-Shapiro (gps@gte.com) Subject: Business Week on Database Marketing Recent issue of Business Week (Sep 5, 1994) featured a cover story on Database Marketing. Here is my summary of that article. Here is how Database Marketing works. Step 1. Consumer buys a product (anything from an airline ticket to xylophone) Step 2. The transaction is recorded directly or indirectly (when a consumer sends in a coupon, fills a warranty, etc.) and is entered into the customer database. Step 3. Demographic, census, credit, and other available and permissible data is added and merged with customer record. Step 4. Company uses ML/Statistical techniques (typically neural networks) to identify one (or more) profile of a desired customer. Step 5. The profile(s) are applied back to the customer database to select customers similar to the desired ones. Step 6. The generated lists of desired cutomers are then used for direct mail, telemarketing etc. While a typical direct mail response is between 2 and 4%, database marketing response has been reported to be 10-20% and sometimes even higher! No wonder that, according to Donnelley Marketing, 56% of the manufacturers and retailers are currently using or building such a database. American Express has used a form of database marketing in a test in Europe and reports a 15% increase in charges. Test in USA is planned in 1995. Other companies mentioned among the pioneers of the database marketing include Philip Morris, House of Seagram, General Motors, and Blockbuster Entertainment. Business Week also discusses the flip side of Database Marketing -- the potential for invasion of Privacy, and warns that if marketers do not police themselves, then public backlash will force Congress to enact tough restrictions on data use, such as requiring ***explicit*** pre-authorization for any use of personal data.