KDD Nuggets 95:22, e-mailed 95-09-11
Contents: 
	* GPS, IEEE Expert, Aug 1995, article on Data-Mining the Cosmos at JPL
	* D. Grube, Opinion: re Nuggets 95:21, "Expert Systems and KDD"
	* GPS, Test Datasets for Data Mining
		http://info.gte.com/~kdd/datasets.html
	* Radford Neal, Software for Bayesian Learning For Neural Networks
		http://www.cs.toronto.edu/~radford
	* GPS, European Privacy Laws Strengthened
	* E. Rigdon, CFP: Advanced Research Techniques Forum
	* B. Prior, HPCwire: Data Mining and Forecasting at J.P. MORGAN
	* M. Kamath, Web Site to Search Database Bibliographies
		http://www-ccs.cs.umass.edu/db/bib-search.html
The KDD Nuggets is a moderated mailing list for information
relevant to Data Mining and Knowledge Discovery in Databases (KDD).
Please include a DESCRIPTIVE subject line and a URL, when 
available, in your submission.

Nuggets frequency is approximately weekly. 

 Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), 
 references, FAQ, and other KDD-related information are available 
 at Knowledge Discovery Mine, URL http://info.gte.com/~kdd
 by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README

E-mail add/delete requests to kdd-request@gte.com 
E-mail contributions to kdd@gte.com
	-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) * 
* and not necessarily of their respective employers (or GTE Laboratories)   *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A term for waiting for your Web Browser:  World Wide Wait
		Nahum Gershon (MITRE), at SIGGRAPH-95

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 8 Sep 1995 16:32:42 -0400
From: gps@gte.com (Gregory Piatetsky-Shapiro)
Subject: IEEE Expert, Aug 1995, article on Data-Mining the Cosmos at JPL

A very nice article by Dick Price, entitled "Starlight, Star Bright:
Data Mining the Cosmos", profiling the work of Usama Fayyad, Padhraic
Smyth (both of JPL) and their colleagues at Caltech was published in
IEEE Expert, Aug 1995.  

SKICAT, Sky Image Cataloging and Analysis Tool, developed by Fayyad
and others, uses a decision-tree approach to classify the millions of
sky objects and does it much faster and more accurately than any human
could.  SKICAT was first presented at ML-93 and KDD-93 and appeared
since in several publications, including the forthcoming Advances in
Knowledge Discovery and Data Mining (AAAI/MIT Press).  To date, SKICAT
has cataloged hundreds of millions of sky objects and has recently
helped discover 10 new high red-shift quasars in the universe.

Another JPL system, JARtool (JPL Adaptive Recognition Tool) 
developed by Usama and Padhraic, in collaboration with Burl and
Perona at Caltech-EE, has been applied to analysis of Venus SAR images,
and enables the cataloging of small volcanoes on the surface.
JARtool has also been described at KDD-93 and KDD-94 and also
 in the forthcoming Advances in KDDM book.

The IEEE Expert article is interesting (and complements the
 technical articles on SKICAT and JARtool) because it contains
 numerous quotes from the scientist's (user's) perspective
 illustrating the challenges and potential of KDD in science applications.
 There's room for many more applications !

Congrats again to the entire JPL/Caltech team! 

For more information on JPL work see http://www-aig.jpl.nasa.gov/

GPS

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Date: Thu, 7 Sep 1995 13:28:32 -0400
From: dmg@cartoon.lc.att.com (Grube David)
To: kdd@gte.com
Subject: Opinion: In response to issue 95:21, "Expert Systems and KDD"

Mr. Gregory Piatetsky-Shapiro, KDD Aficionados,
  Having worked with database and knowledge-base systems for several
years, I recently have become interested in KDD, and am an avid reader
of KDD Nuggets.  I'm doing a bit of research into a KDD system for use
at AT&T; in particular, I'm pursuing an approach to KDD that provides
for a reasonably autonomous discovery process, yet restricts the
problem space via automated application of domain knowledge. Unfortunately, 
this is not my reason for responding to Nuggets.

  In the 95:21 issue of Nuggets, there was an editorial sent in by a Nuggets
member, who had expressed a concern over the database component versus the
expert system component of KDD. What follows is not a response aimed at this
particular individual, but rather at the computer science "industry" 
(academics and otherwise) at large.

  At the risk of starting an all-out offensive, which is NOT what KDD Nuggets
should be concerned with, or waste space on, I posit that KDD is, if nothing
else, an inter/intra-disciplinary endeavor. It involves DBMS's, AI, Knowledge-
Base systems, expert systems, statistics, etc.  Each component serves an
important function in the overall system, and rather than pit the disciplines
against one another by either over or under-emphasizing the relative merits of
one versus the other (which is a boring effort that produces no tangible 
result), we should strive to educate. Not only educate others, but educate
ourselves, from the aspect of any given discipline.  If, for example, I am a
database expert, I daresay that I have quite a bit to learn from other 
branches of computer science, that will only help me in applying my database
principles to KDD.

  Over the years, the lines of demarcation between the various disciplines
within computer science have become blurred. While definition is good, (and at
some point a line typically MUST be drawn between where/when component A of
a KDD system takes over from component B), I also prefer to look at each
component as working together towards a common goal; i.e., producing knowledge
where previously this knowledge was not yet found. I suspect that this is the
intended result of KDD, not to find out which particular component of a KDD 
system is more important than the other. Let us put to rest these non-relevant 
issues, and get on with the real work: that of researching the various
disciplines and branches within those disciplines, applying what we have 
learned, and ultimately producing useful KDD (and other) systems. In this
manner, we benefit ourselves, our employers, and the computer industry at
large.

  I will close with some humor. I recently attended a 2-day Knowledge Discovery
in Databases conference held by my employer, AT&T.  An interesting comment came
out of this conference in reference to a particular software vendor. This
vendor markets a reporting tool, and when KDD/Data mining was not as popular
a term, this particular reporting tool was marketed as simply "an ad-hoc
database reporting tool".  Now, however, in an attempt to capitalize on the
burgeoning KDD/Data mining market, this ad-hoc reporting tool is now being
billed as a "knowledge discovery and data mining tool".  My friends, let us
not fight amongst ourselves, let us unite to fight the common enemy: the
software vendors.   :-)

	David M. Grube
    AT&T Bell Laboratories
          Warren, NJ

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 7 Sep 1995 11:20:15 -0400
From: gps@gte.com (Gregory Piatetsky-Shapiro)
Subject: Datasets for Data Mining


I have added a new top-level entry at KD Mine with datasets useful for
testing data mining.  Here it is below.  
Any additions or updates are most welcome (in particular, the current 
pointer to STATLOG datasets does not work -- does anybody know 
a more recent one?)

-- GPS
----------------
<TITLE>Datasets for Data Mining</TITLE>

<H2><img src="images/info.gif"> Datasets for Data Mining</H2>

<UL>
<LI>
<a href="http://www.ics.uci.edu/AI/ML/Machine-Learning.html">
The Machine Learning Database Repository</a> 
at UC Irvine --
  the site for data sets and domain theories that have been commonly used 
       for evaluate learning algorithms. 

<LI>
<a href=http://nssdc.gsfc.nasa.gov>National Space Science Data Center</a>
 (NSSDC) now has a WWW homepage
 with a significant amount of information about all of NASA's data sets
 from planetary exploration, space and solar physics, life sciences,
 astrophysics, including many links to other sites. This is probably the
 most useful place to start for people interested in data mining of
 NASA's vast scientific data sets. 

<LI><A HREF="ftp://ftp.ira.uka.de/pub/neuron/proben1.tar.gz">
Proben1</a> --- A Set of Neural Network Benchmark Problems (Size: ~2Mb !!), 
and <A HREF="ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.Z">
Proben1 --- Description (Tech Report)</A>. 

<LI><A HREF="ftp://ftp.ncc.up.pt/pub/statlog/">STATLOG (Esprit Project)</A>. 

<LI><A HREF="ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases">
Elena datasets</a> 3 artificial databases ('Gaussian', 'Clouds' and 'Concentric') and 4 real databases ('Satimage', 'Texture', 'Iris' and 'Phoneme'). 
Here is <A HREF="ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases/README">additional documentation.</a>


<LI><A HREF="http://www.census.gov/">United States Census Bureau</A>. 

</UL>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Subject: Software for Bayesian learning available
From:    Radford Neal <radford@cs.toronto.edu>
Date:    Thu, 10 Aug 1995 13:04:42 -0400


                          Announcing Software for

                    BAYESIAN LEARNING FOR NEURAL NETWORKS

                     Radford Neal, University of Toronto

 Software for Bayesian learning of models based on multilayer perceptron 
 networks, using Markov chain Monte Carlo methods, is now available by 
 ftp.  This software implements the methods described in my Ph.D. thesis,
 "Bayesian Learning for Neural Networks".  Use of the software is free 
 for research and educational purposes.

 The software supports models for regression and classification problems
 based on networks with any number of hidden layers, using a wide variety
 of prior distributions for network parameters and hyperparameters.  The
 advantages of Bayesian learning include the automatic determination of 
 "regularization" parameters, without the need for a validation set, 
 avoidance of overfitting when using large networks, and quantification of 
 the uncertainty in predictions.  The software implements the Automatic 
 Relevance Determination (ARD) approach to handling inputs that may turn 
 out to be irrelevant (developed with David MacKay).  For problems and 
 networks of moderate size (eg, 200 training cases, 10 inputs, 20 hidden 
 units), full training (to the point where one can be reasonably sure that 
 the correct Bayesian answer has been found) typically takes several hours 
 to a day on our SGI machine.  However, quite good results, competitive 
 with other methods, are often obtained after training for under an hour.  
 (Of course, your machine may not be as fast as ours!)

 The software is written in ANSI C, and has been tested on SGI and Sun
 machines.  Full source code is included.

 Both the software and my thesis can be obtained by anonymous ftp, or via
 the World Wide Web, starting at my home page.  It is essential for you 
 to have read the thesis before trying to use the software.

 The URL of my home page is http://www.cs.toronto.edu/~radford.  If for
 some reason this doesn't work, you can get to the same place using the
 URL ftp://ftp.cs.toronto.edu/pub/radford/www/homepage.html.  From the
 home page, you will be able to get both the thesis and the software.

 To get the thesis and the software by anonymous ftp, use the host name
 ftp.cs.toronto.edu, or one of the addresses 128.100.3.6 or 128.100.1.105.
 After logging in as "anonymous", with your e-mail address as the password,
 change to directory pub/radford, make sure you are in "binary" mode, and
 get the files thesis.ps.Z and bnn.tar.Z.  The file bnn.doc contains just
 the documentation for the software, but this is included in bnn.tar.Z,
 so you will need it only if you need to read how to unpack a tar archive,
 or don't want to transfer the whole thing.  The files ending in .Z should
 be uncompressed with the "uncompress" command.  The thesis may be printed 
 on a Postscript printer, or viewed with ghostview.

 If you have any problems obtaining the thesis or the software, please
 contact me at one of the addresses below.

 ---------------------------------------------------------------------------
 Radford M. Neal                                      radford@cs.toronto.edu
 Dept. of Statistics and Dept. of Computer Science  radford@stat.toronto.edu
 University of Toronto                    http://www.cs.toronto.edu/~radford


>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 31 Aug 1995 15:22:38 -0400
From: gps@gte.com (Gregory Piatetsky-Shapiro)
Subject: European Privacy Laws Strengthened 

Communications of ACM Newstrack (Sep 1995, 38:9, p. 11) reports that
the European Union approved privacy directive that requires far
stronger protection of personal data.  This will require changes in
the way U.S. and multi-national companies handle data about European
employees and customers.  Provisions of the directive include
prohibiting export of personal data to countries with "weak" provacy
laws, notifying individuals of intended use of personal data, and
prohibiting the release of data without the individual's consent. 

[See also IEEE Expert mini-symposium on KDD vs Privacy, April 1995
and KDD Nuggets 95:15 for more detailed discussion of Privacy issues.]
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 31 Aug 1995 17:25:04 -0500
From: ED RIGDON <MKTEER@langate.gsu.edu>
Subject:  Advanced Research Techniques Forum

     Every year the American Marketing
Association sponsors an "Advanced Research
Techniques Forum"--aka, the ART
Forum--where academics and marketing
practitioners comes together to discuss the
latest research methods.  The focus is on
practical applications, and the audience is
usually 2 practitioners (from marketing and
commercial research firms) to 1 academic, with
attendance limited to about 200.  It is also
design to encourage networking.
     The 1996 ART Forum will be held in early
June, in Beaver Creek, Colorado.  The call for
papers will go out in early November 1995, and
initial papers are due by early January.  The
short lead time involved is why I wanted to pass
this note along to KDD subscribers who might
be interested in presenting.
     The emphasis at the ART Forum is on
practical applications solving marketing
problems.  The goal is to have presentations
with solid "take-aways" for the blue-chip
audience.
     For more information, contact Mark Garratt
at (414) 931-2425, fax (414) 931-3187.  He
would be happy to put people on the mailing list
for the official call for papers.
--Ed Rigdon     

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 01 Sep 95 12:21:36
From: prior@MIT.EDU (Bob Prior - MIT Press)
Subject: [more@newsmaster.tgc.com: 5858 CONVEX DATA MINING AND FORECASTING
 EMPLOYED BY J.P. MORGAN HPCwire]

Gregory,

I don't know if you see material from HPCwire, an electronic
newsletter for the High-performance Computing community, but here is
an article from this week's issue. 

--Bob

------- Forwarded Message

Date: Fri, 1 Sep 95 08:02:02 -0700
From: more@newsmaster.tgc.com
To: prior@MIT.EDU
Subject: 5858 CONVEX DATA MINING AND FORECASTING EMPLOYED BY J.P. MORGAN HPCwire

CONVEX DATA MINING AND FORECASTING EMPLOYED BY J.P. MORGAN         HPCwire
COMMERCIAL NEWS                                              Sept. 1, 1995
==========================================================================

  Richardson, Texas -- Financial giant J.P. Morgan is one of the first
companies to opt for Convex Computer Corp.'s complex data mining and
forecasting applications -- Information Harvester software on the Convex
Examplar and C Series.

  Released for general availability earlier this week, this high
performance solution can help market researchers in a range of industries
from retail and insurance to financial and telecommunications, a Convex
spokesperson noted.

  "One of the premiere justifications for implementing a data warehousing
solution is having a data mining tool in place that can access the data
within it," said Charles Bonomo, vice president of advanced technology for
J.P. Morgan. "The promise of data mining tools like Information Harvester
is that they are able to quickly wade through massive amounts of data to
identify relationships or trending information that would not have been
available without the tool."

  The flexibility of the Information Harvesting induction algorithm
enables it to adapt to any system. The data can be in the form of numbers,
dates, codes, categories, text or any combination thereof. Information
Harvester is designed to handle faulty, missing and noisy data. Large
variations in the values of an individual field do not hamper the
analysis. Information Harvester has unique abilities to recognize and
ignore irrelevant data fields when searching for patterns. In full-scale
parallel-processing versions, Information Harvester can handle millions of
rows and thousands of variables.

  The Exemplar series, based on HP's high performance PA-RISC processors,
is the first supercomputer-class family of systems to track the
price/performance development cycle of the desktop. Since the product's
introduction just over a year ago, more than 100 Exemplar systems have
been installed at customer and Independent Software Vendor (ISV) sites
worldwide. They are being used for a range of applications including
automotive, tire and aircraft design, petroleum research and exploration,
seismic processing, and university, scientific and biomedical research.

  "Convex systems are widely recognized for excellent performance in
applications that require advanced data and file management, computation,
and analytics," said Jay H. Atlas, Convex Vice President and General
Manager, Worldwide Sales and Marketing. "These hardware and software.
system capabilities, combined with Information Harvester's unique modeling
and forecasting features, provide users the most powerful data mining
solution in the market."

*****************************************************************************
                      H P C w i r e   S P O N S O R S                       
       Product specifications and company information in this section are    
             available to both subscribers and non-subscribers.              
                                                                             
  912) Avalon Computer       915) Genias Software       905) MAXIMUM STRATEGY
  934) Convex Computer Corp. 930) HNSX Supercomputers   906) nCUBE
  921) Cray Research Inc.    902) IBM Corp.             932) Portland Group
  909) Fujitsu America       904) Intel Corp.           935) Silicon Graphics
                             916) MasPar Computer       
    
*****************************************************************************
Copyright 1995 HPCwire.
To receive the weekly HPCwire at no charge, send e-mail without text to
"trial@hpcwire.tgc.com". 

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: kamath@cs.umass.edu (Mohan U. Kamath)
Newsgroups: comp.databases
Subject: *NEW* Web Site to Search DATABASE BIBLIOGRAPHIES
Date: 5 Sep 1995 17:35:13 GMT
Keywords: SEARCH, DATABASE,  BIBLIOGRAPHIES
 
A new web page has been established for quickly searching 
database research bibliographies. The URL for the web page is:
     http://www-ccs.cs.umass.edu/db/bib-search.html
 
BRIEF DESCRIPTION:
------------------
 
* Search is performed using a fast indexing/search engine built locally. It
  uses inverted lists, multilevel indexing and other optimizations to perform
  competitively with other bibliography sites. 
 
* The search engine uses close to 10 MB (20,000+) of database research 
  bibliography entries. 
 
* search using single and multiple keyword (AND), case (in)sensitive,
  partial/full match
 
* Number of entries returned can be selected from 10, 50, 100 and 200
 
* Approximate response times for retrieving 100 entries that match 1,
  2 and 3 keywords are 3, 6 and 8 seconds respectively, though *actual*
  response time may vary depending upon the keyword frequency, network and 
  server traffic.
 
Check it out!
 
 
-- 
 Mohan Kamath                        |  Department of Computer Science   

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~