KDD Nuggets 95:22, e-mailed 95-09-11 Contents: * GPS, IEEE Expert, Aug 1995, article on Data-Mining the Cosmos at JPL * D. Grube, Opinion: re Nuggets 95:21, "Expert Systems and KDD" * GPS, Test Datasets for Data Mining http://info.gte.com/~kdd/datasets.html * Radford Neal, Software for Bayesian Learning For Neural Networks http://www.cs.toronto.edu/~radford * GPS, European Privacy Laws Strengthened * E. Rigdon, CFP: Advanced Research Techniques Forum * B. Prior, HPCwire: Data Mining and Forecasting at J.P. MORGAN * M. Kamath, Web Site to Search Database Bibliographies http://www-ccs.cs.umass.edu/db/bib-search.html The KDD Nuggets is a moderated mailing list for information relevant to Data Mining and Knowledge Discovery in Databases (KDD). Please include a DESCRIPTIVE subject line and a URL, when available, in your submission. Nuggets frequency is approximately weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A term for waiting for your Web Browser: World Wide Wait Nahum Gershon (MITRE), at SIGGRAPH-95 >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Fri, 8 Sep 1995 16:32:42 -0400 From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: IEEE Expert, Aug 1995, article on Data-Mining the Cosmos at JPL A very nice article by Dick Price, entitled "Starlight, Star Bright: Data Mining the Cosmos", profiling the work of Usama Fayyad, Padhraic Smyth (both of JPL) and their colleagues at Caltech was published in IEEE Expert, Aug 1995. SKICAT, Sky Image Cataloging and Analysis Tool, developed by Fayyad and others, uses a decision-tree approach to classify the millions of sky objects and does it much faster and more accurately than any human could. SKICAT was first presented at ML-93 and KDD-93 and appeared since in several publications, including the forthcoming Advances in Knowledge Discovery and Data Mining (AAAI/MIT Press). To date, SKICAT has cataloged hundreds of millions of sky objects and has recently helped discover 10 new high red-shift quasars in the universe. Another JPL system, JARtool (JPL Adaptive Recognition Tool) developed by Usama and Padhraic, in collaboration with Burl and Perona at Caltech-EE, has been applied to analysis of Venus SAR images, and enables the cataloging of small volcanoes on the surface. JARtool has also been described at KDD-93 and KDD-94 and also in the forthcoming Advances in KDDM book. The IEEE Expert article is interesting (and complements the technical articles on SKICAT and JARtool) because it contains numerous quotes from the scientist's (user's) perspective illustrating the challenges and potential of KDD in science applications. There's room for many more applications ! Congrats again to the entire JPL/Caltech team! For more information on JPL work see http://www-aig.jpl.nasa.gov/ GPS >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Thu, 7 Sep 1995 13:28:32 -0400 From: dmg@cartoon.lc.att.com (Grube David) To: kdd@gte.com Subject: Opinion: In response to issue 95:21, "Expert Systems and KDD" Mr. Gregory Piatetsky-Shapiro, KDD Aficionados, Having worked with database and knowledge-base systems for several years, I recently have become interested in KDD, and am an avid reader of KDD Nuggets. I'm doing a bit of research into a KDD system for use at AT&T; in particular, I'm pursuing an approach to KDD that provides for a reasonably autonomous discovery process, yet restricts the problem space via automated application of domain knowledge. Unfortunately, this is not my reason for responding to Nuggets. In the 95:21 issue of Nuggets, there was an editorial sent in by a Nuggets member, who had expressed a concern over the database component versus the expert system component of KDD. What follows is not a response aimed at this particular individual, but rather at the computer science "industry" (academics and otherwise) at large. At the risk of starting an all-out offensive, which is NOT what KDD Nuggets should be concerned with, or waste space on, I posit that KDD is, if nothing else, an inter/intra-disciplinary endeavor. It involves DBMS's, AI, Knowledge- Base systems, expert systems, statistics, etc. Each component serves an important function in the overall system, and rather than pit the disciplines against one another by either over or under-emphasizing the relative merits of one versus the other (which is a boring effort that produces no tangible result), we should strive to educate. Not only educate others, but educate ourselves, from the aspect of any given discipline. If, for example, I am a database expert, I daresay that I have quite a bit to learn from other branches of computer science, that will only help me in applying my database principles to KDD. Over the years, the lines of demarcation between the various disciplines within computer science have become blurred. While definition is good, (and at some point a line typically MUST be drawn between where/when component A of a KDD system takes over from component B), I also prefer to look at each component as working together towards a common goal; i.e., producing knowledge where previously this knowledge was not yet found. I suspect that this is the intended result of KDD, not to find out which particular component of a KDD system is more important than the other. Let us put to rest these non-relevant issues, and get on with the real work: that of researching the various disciplines and branches within those disciplines, applying what we have learned, and ultimately producing useful KDD (and other) systems. In this manner, we benefit ourselves, our employers, and the computer industry at large. I will close with some humor. I recently attended a 2-day Knowledge Discovery in Databases conference held by my employer, AT&T. An interesting comment came out of this conference in reference to a particular software vendor. This vendor markets a reporting tool, and when KDD/Data mining was not as popular a term, this particular reporting tool was marketed as simply "an ad-hoc database reporting tool". Now, however, in an attempt to capitalize on the burgeoning KDD/Data mining market, this ad-hoc reporting tool is now being billed as a "knowledge discovery and data mining tool". My friends, let us not fight amongst ourselves, let us unite to fight the common enemy: the software vendors. :-) David M. Grube AT&T Bell Laboratories Warren, NJ >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Thu, 7 Sep 1995 11:20:15 -0400 From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: Datasets for Data Mining I have added a new top-level entry at KD Mine with datasets useful for testing data mining. Here it is below. Any additions or updates are most welcome (in particular, the current pointer to STATLOG datasets does not work -- does anybody know a more recent one?) -- GPS ---------------- Datasets for Data Mining

Datasets for Data Mining

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Subject: Software for Bayesian learning available From: Radford Neal Date: Thu, 10 Aug 1995 13:04:42 -0400 Announcing Software for BAYESIAN LEARNING FOR NEURAL NETWORKS Radford Neal, University of Toronto Software for Bayesian learning of models based on multilayer perceptron networks, using Markov chain Monte Carlo methods, is now available by ftp. This software implements the methods described in my Ph.D. thesis, "Bayesian Learning for Neural Networks". Use of the software is free for research and educational purposes. The software supports models for regression and classification problems based on networks with any number of hidden layers, using a wide variety of prior distributions for network parameters and hyperparameters. The advantages of Bayesian learning include the automatic determination of "regularization" parameters, without the need for a validation set, avoidance of overfitting when using large networks, and quantification of the uncertainty in predictions. The software implements the Automatic Relevance Determination (ARD) approach to handling inputs that may turn out to be irrelevant (developed with David MacKay). For problems and networks of moderate size (eg, 200 training cases, 10 inputs, 20 hidden units), full training (to the point where one can be reasonably sure that the correct Bayesian answer has been found) typically takes several hours to a day on our SGI machine. However, quite good results, competitive with other methods, are often obtained after training for under an hour. (Of course, your machine may not be as fast as ours!) The software is written in ANSI C, and has been tested on SGI and Sun machines. Full source code is included. Both the software and my thesis can be obtained by anonymous ftp, or via the World Wide Web, starting at my home page. It is essential for you to have read the thesis before trying to use the software. The URL of my home page is http://www.cs.toronto.edu/~radford. If for some reason this doesn't work, you can get to the same place using the URL ftp://ftp.cs.toronto.edu/pub/radford/www/homepage.html. From the home page, you will be able to get both the thesis and the software. To get the thesis and the software by anonymous ftp, use the host name ftp.cs.toronto.edu, or one of the addresses 128.100.3.6 or 128.100.1.105. After logging in as "anonymous", with your e-mail address as the password, change to directory pub/radford, make sure you are in "binary" mode, and get the files thesis.ps.Z and bnn.tar.Z. The file bnn.doc contains just the documentation for the software, but this is included in bnn.tar.Z, so you will need it only if you need to read how to unpack a tar archive, or don't want to transfer the whole thing. The files ending in .Z should be uncompressed with the "uncompress" command. The thesis may be printed on a Postscript printer, or viewed with ghostview. If you have any problems obtaining the thesis or the software, please contact me at one of the addresses below. --------------------------------------------------------------------------- Radford M. Neal radford@cs.toronto.edu Dept. of Statistics and Dept. of Computer Science radford@stat.toronto.edu University of Toronto http://www.cs.toronto.edu/~radford >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Thu, 31 Aug 1995 15:22:38 -0400 From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: European Privacy Laws Strengthened Communications of ACM Newstrack (Sep 1995, 38:9, p. 11) reports that the European Union approved privacy directive that requires far stronger protection of personal data. This will require changes in the way U.S. and multi-national companies handle data about European employees and customers. Provisions of the directive include prohibiting export of personal data to countries with "weak" provacy laws, notifying individuals of intended use of personal data, and prohibiting the release of data without the individual's consent. [See also IEEE Expert mini-symposium on KDD vs Privacy, April 1995 and KDD Nuggets 95:15 for more detailed discussion of Privacy issues.] >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Thu, 31 Aug 1995 17:25:04 -0500 From: ED RIGDON Subject: Advanced Research Techniques Forum Every year the American Marketing Association sponsors an "Advanced Research Techniques Forum"--aka, the ART Forum--where academics and marketing practitioners comes together to discuss the latest research methods. The focus is on practical applications, and the audience is usually 2 practitioners (from marketing and commercial research firms) to 1 academic, with attendance limited to about 200. It is also design to encourage networking. The 1996 ART Forum will be held in early June, in Beaver Creek, Colorado. The call for papers will go out in early November 1995, and initial papers are due by early January. The short lead time involved is why I wanted to pass this note along to KDD subscribers who might be interested in presenting. The emphasis at the ART Forum is on practical applications solving marketing problems. The goal is to have presentations with solid "take-aways" for the blue-chip audience. For more information, contact Mark Garratt at (414) 931-2425, fax (414) 931-3187. He would be happy to put people on the mailing list for the official call for papers. --Ed Rigdon >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Fri, 01 Sep 95 12:21:36 From: prior@MIT.EDU (Bob Prior - MIT Press) Subject: [more@newsmaster.tgc.com: 5858 CONVEX DATA MINING AND FORECASTING EMPLOYED BY J.P. MORGAN HPCwire] Gregory, I don't know if you see material from HPCwire, an electronic newsletter for the High-performance Computing community, but here is an article from this week's issue. --Bob ------- Forwarded Message Date: Fri, 1 Sep 95 08:02:02 -0700 From: more@newsmaster.tgc.com To: prior@MIT.EDU Subject: 5858 CONVEX DATA MINING AND FORECASTING EMPLOYED BY J.P. MORGAN HPCwire CONVEX DATA MINING AND FORECASTING EMPLOYED BY J.P. MORGAN HPCwire COMMERCIAL NEWS Sept. 1, 1995 ========================================================================== Richardson, Texas -- Financial giant J.P. Morgan is one of the first companies to opt for Convex Computer Corp.'s complex data mining and forecasting applications -- Information Harvester software on the Convex Examplar and C Series. Released for general availability earlier this week, this high performance solution can help market researchers in a range of industries from retail and insurance to financial and telecommunications, a Convex spokesperson noted. "One of the premiere justifications for implementing a data warehousing solution is having a data mining tool in place that can access the data within it," said Charles Bonomo, vice president of advanced technology for J.P. Morgan. "The promise of data mining tools like Information Harvester is that they are able to quickly wade through massive amounts of data to identify relationships or trending information that would not have been available without the tool." The flexibility of the Information Harvesting induction algorithm enables it to adapt to any system. The data can be in the form of numbers, dates, codes, categories, text or any combination thereof. Information Harvester is designed to handle faulty, missing and noisy data. Large variations in the values of an individual field do not hamper the analysis. Information Harvester has unique abilities to recognize and ignore irrelevant data fields when searching for patterns. In full-scale parallel-processing versions, Information Harvester can handle millions of rows and thousands of variables. The Exemplar series, based on HP's high performance PA-RISC processors, is the first supercomputer-class family of systems to track the price/performance development cycle of the desktop. Since the product's introduction just over a year ago, more than 100 Exemplar systems have been installed at customer and Independent Software Vendor (ISV) sites worldwide. They are being used for a range of applications including automotive, tire and aircraft design, petroleum research and exploration, seismic processing, and university, scientific and biomedical research. "Convex systems are widely recognized for excellent performance in applications that require advanced data and file management, computation, and analytics," said Jay H. Atlas, Convex Vice President and General Manager, Worldwide Sales and Marketing. "These hardware and software. system capabilities, combined with Information Harvester's unique modeling and forecasting features, provide users the most powerful data mining solution in the market." ***************************************************************************** H P C w i r e S P O N S O R S Product specifications and company information in this section are available to both subscribers and non-subscribers. 912) Avalon Computer 915) Genias Software 905) MAXIMUM STRATEGY 934) Convex Computer Corp. 930) HNSX Supercomputers 906) nCUBE 921) Cray Research Inc. 902) IBM Corp. 932) Portland Group 909) Fujitsu America 904) Intel Corp. 935) Silicon Graphics 916) MasPar Computer ***************************************************************************** Copyright 1995 HPCwire. To receive the weekly HPCwire at no charge, send e-mail without text to "trial@hpcwire.tgc.com". >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: kamath@cs.umass.edu (Mohan U. Kamath) Newsgroups: comp.databases Subject: *NEW* Web Site to Search DATABASE BIBLIOGRAPHIES Date: 5 Sep 1995 17:35:13 GMT Keywords: SEARCH, DATABASE, BIBLIOGRAPHIES A new web page has been established for quickly searching database research bibliographies. The URL for the web page is: http://www-ccs.cs.umass.edu/db/bib-search.html BRIEF DESCRIPTION: ------------------ * Search is performed using a fast indexing/search engine built locally. It uses inverted lists, multilevel indexing and other optimizations to perform competitively with other bibliography sites. * The search engine uses close to 10 MB (20,000+) of database research bibliography entries. * search using single and multiple keyword (AND), case (in)sensitive, partial/full match * Number of entries returned can be selected from 10, 50, 100 and 200 * Approximate response times for retrieving 100 entries that match 1, 2 and 3 keywords are 3, 6 and 8 seconds respectively, though *actual* response time may vary depending upon the keyword frequency, network and server traffic. Check it out! -- Mohan Kamath | Department of Computer Science >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~