KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Data Mining and Knowledge Discovery Nuggets 96:25, e-mailed 96-08-06

News:

* GPS, KDD-96 Conference -- Second Wind for KDD

* R. Uthurusamy, KDD-97 CFP (draft), http://www-aig.jpl.nasa.gov/kdd97/

* S. Vere, Data Mining Repository for Very Large Datasets

* J. Fuernkranz, Summary of ICML-96 Workshop on ILP for KDD,

http://www.ai.univie.ac.at/ilp_kdd/

* J. Johnson, Growth of databases and the largest DB today?
Siftware:

* D. Lin, Data mining services and tools available via Internet

* S. Tate, The Data Mining Suite from Information Discovery, Inc.
Meetings:

* R. Zicari, Object World Frankfurt 96 and Internet Forum Europe 96,

http://www.ltt.de

--
Discovery community, focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
from the comic strip 'Frank and Ernest'
Frank: Hey, look! This article says scientists have invented a robot
so advanced it's almost human!

Ernest: What makes it 'almost human?'

Frank: It has started blaming its mistakes on the other robots.
Thanks to Susan Tafolla

Previous 1 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 6 Aug 1996 15:07:39 -0400
From: Gregory Piatetsky-Shapiro (gps0%eureka@gte.com)
Subject: KDD-96 Conference -- Second Wind

The Second International Conference of Knowledge Discovery and Data
Mining, just held in Portland, OR, Aug 2-4, 1996 was a great
success. It was attended by over 500 researchers, developers, and
practitioners. There were so many new research ideas
that I cannot summarize them all here, but conference proceedings are
available from AAAI -- see
http://www.aaai.org/Publications/Press/Catalog/KDD/han.html

Many exciting business and scientific applications were also
presented, ranging from detecting cellular phone fraud to finding
earthquake changes from space to analysis of genetic structure.

The demo and poster sessions were full of participants (so full that
many people found it difficult to see the demos) and the bulletin
board outside was full of job ads.

I think that this conference represents a second wind -- an
acceleration of pace of work in Data Mining and Knowledge Discovery (DMKD).
Now most people in the industry have heard of the benefits of
DMKD and the expectations are high.
It is up to this community to deliver the expected results.

Previous 2 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: samy@ru.cs.gmr.com (R. Uthurusamy)
Subject: KDD-97 CFP
Here is the first version of the KDD-97 CFP fyi.
This will be available at KDD-96.
A KDD-97 Poster also will be distributed at KDD-96.

The next version will include the list of PC members and more details.

Any and all suggestions are welcome for us to include in the next CFP.

Please visit the KDD-97 website www-aig.jpl.nasa.gov/kdd97.

A new feature we have included in this site is the HyperNews facility
for everyone to interact with.

Looking forward to your ideas.

Thanks.
sam
----------------------------------------------------------------------------
The Third International Conference on
Knowledge Discovery and Data Mining (KDD-97)

August 14-17, 1997
Newport Beach, California, U.S.A.

Sponsored by the American Association for Artificial Intelligence
----------------------------------------------------------------------------

Call for Papers

The rapid growth of data and information has created a need and
an opportunity for extracting knowledge from databases, and both
researchers and application developers have been responding to that need.
Knowledge discovery in databases (KDD), also referred to as data mining, is
an area of common interest to researchers in machine discovery, statistics,
databases, knowledge acquisition, machine learning, data visualization, high
performance computing, and knowledge-based systems. KDD applications have
been developed for astronomy, biology, finance, insurance, marketing,
medicine, and many other fields.

The third international conference on knowledge discovery and
data mining (KDD-97) will follow up the success of KDD-95 and KDD-96
by bringing together researchers and application developers from
different areas focusing on unifying themes.

Suggested Topics

The topics of interest include, but are not limited to:

Theory and Foundational Issues in KDD

* Data and knowledge representation for KDD
* Probabilistic modeling and uncertainty management in KDD
* Modeling of structured, unstructured and multimedia data
* Fundamental advances in search, retrieval, and discovery methods
* Definitions, formalisms, and theoretical issues in KDD

Data Mining Methods and Algorithms

* Algorithmic complexity, efficiency and scalability issues in data
mining
* Probabilistic and statistical models and methods
* Using prior domain knowledge and re-use of discovered knowledge
* Parallel and distributed data mining techniques
* High dimensional datasets and data preprocessing
* Unsupervised discovery and predictive modeling

KDD Process and Human Interaction

* Models of the KDD process
* Methods for evaluating subjective relevance and utility
* Data and knowledge visualization
* Interactive data exploration and discovery
* Privacy and security

Applications

* Data mining systems and data mining tools
* Application of KDD in business, science, medicine and engineering
* Application of KDD methods for mining knowledge in text, image, audio,
sensor, numeric, categorical or mixed format data
* Resource and knowledge discovery using the Internet

This list of topics is not intended to be exhaustive but an indication of
typical topics of interest. Prospective authors are encouraged to submit
papers on any topics of relevance to knowledge discovery and data mining.

Demonstration Sessions

KDD-97 also invites working demonstrations of discovery systems.
Contact information for details is provided below.

Submission and Review Criteria

Both research and applications papers are solicited. All submitted papers
will be reviewed on the basis of technical quality, relevance to KDD,
novelty, significance, and clarity. Authors are encouraged to make their
work accessible to readers from other disciplines by including a carefully
written introduction. Papers should clearly state their relevance to KDD.

Please submit 7 hardcopies of a short paper (a maximum of 9 single-spaced
pages not including cover page and bibliography, 1 inch margins,
and 12pt font) to be received by March 10, 1997. A cover page must include
author(s) full address, email, paper title and a 200 word abstract, and up
to 5 keywords. This cover page must accompany the paper. In addition, an
ascii version of the cover page should be sent electronically via email to
kdd97pgm@aig.jpl.nasa.gov by March 3, 1997 (preferably earlier).
For the electronic title page, authors are required to use the template,
available by ftp at http://www-aig.jpl.nasa.gov/kdd97/.

Please mail the 7 hardcopies of the full papers to:

AAAI (KDD-97)
445 Burgess Drive
Menlo Park, CA 94025-3496 USA
Phone: (+1 415) 328-3123
Fax: (+1 415) 321-4457
Email: kdd@aaai.org
Web Site: http://www.aaai.org.

Important Dates

* Submissions Due: March 10, 1997
* Acceptance Notice: April 28, 1997
* Camera-ready paper due: May 26, 1997

KDD-97 Organization
-------------------

General Conference Chair

Ramasamy Uthurusamy (General Motors Corporation, USA)

Program Co-Chairs

David Heckerman (Microsoft Research, USA)
Heikki Mannila (University of Helsinki, Finland)
Daryl Pregibon (AT&T Research, USA)

Publicity Chair

Paul Stolorz (Jet Propulsion Laboratory, USA)

Demo and Poster Sessions Chair

Tej Anand (NCR Corporation, USA)

Contact Information
-------------------

For further information, send inquiries regarding

* submission logistics to AAAI at kdd@aaai.org
Phone: (+1 415) 328-3123
Fax: (+1 415) 321-4457

* KDD-97 sponsorship and industry participation to
Ramasamy Uthurusamy samy@gmr.com
Phone: 810-696-0669
Fax: 810-696-0580

* technical program and content to kdd97pgm@aig.jpl.nasa.gov

* demo and poster sessions to tanand@winhitc.atlantaga.ncr.com

* general and publicity issues to kdd97@aig.jpl.nasa.gov

Previous 3 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Subject: Virtual Data Mining Repository
Date: Tue, 6 Aug 1996 13:20:27 -0700 (PDT)
From: vere@ultragem.com

At the KDD conference which was just held, it was observed that we don't
presently have a good set of large databases available to serve as benchmarks.
The UC Irvine repository is the best we have, but it was assembled with
the needs of machine learning in mind, and the databases tend to be small,
in the kilobyte range, with only a few hundred or a thousand records.

I would like to propose that we create a virtual repository of LARGE,
industrial strength databases to serve as data mining benchmarks.
These benchmark databases would be valuable both for data mining
researchers as well as vendors and users. Data mining researchers would
be able to compare the performance of their new algorithms with
existing algorithms in a scientific, quantitative way. Vendors would be
able to quantify the performance of their systems, so that potential
purchasers of software or services could make informed decisions about
how to spend their money. Right now business users are confused by the
array of alternatives, and decisions are being made based more on the size
of a company's advertising budget than on the quality of its offerings.
It seems that every vendor is claiming in their marketing brochures that
they have the top technology. Until we have better standards, these claims
are not well founded. At Ultragem we have posted performance figures for
some of the Statlog databases, but as noted these are all at the small end of
the size spectrum.
There are a number of problems to be overcome in establishing a set of
benchmarks for data mining. First, databases in the megabyte and gigabyte
range would impose a greater financial burden on the host computer. The
solution to this is to make the repository a virtual one, with contributors
hosting the data on their own computer. The virtual repository
would simply be a web page of pointers to the files, together with brief
descriptions. Also, for the largest databases it may be necessary to suffer the
inconvenience of transmitting the data by mailing tapes, instead of using
the Internet.
There may also be political resistance from established vendors and even
from established research groups for an objective metric for comparing
algorithms and products. However, I would argue that such resistance
is short-sighted. For the long-run health and longevity of the data mining
field, objective standards will benefit everyone: researchers, vendors, and
users.
There may be some reluctance on the part of business people in making
company data publicly available. Certainly there will be corporate
databases that must be kept confidential. However, the inducement for
businesses to contribute databases is that they may well have researchers
and vendors all scrambling to mine their data for free.
The precise breakpoint between the data mining repository and the Irvine
ML repository might be around the one megabyte threshold. I am contacting
Irvine people to get their opinions and reactions. The actual location of the
new virtual repository is yet to be determined.
If no one else volunteers for the chore, I would be glad to
set up the page at our Ultragem site. The important thing is that it be done
right away. It doesn't have to be perfect on the first iteration
If you have a large database which you would like to contribute to the new
virtual data mining repository, please contact me.

_____________________________________________________________________
__ __ __ __ __ __ __ __ __ __ __
|__| |__| |__| |__| |__| |__| |__| |__| |__| |__| |__| |

Dr. Steven Vere (408) 338-3302
Ultragem Data Mining fax: (408) 338-7503
450 Wildberry Drive e-mail: vere@ultragem.com
Boulder Creek, CA 95006 web: http://www.ultragem.com
__ __ __ __ __ __ __ __ __ __ __
|__| |__| |__| |__| |__| |__| |__| |__| |__| |__| |__| |
_____________________________________________________________________

[At KDD-96 conference, many people including me supported the idea
of having large (mega and giga-byte size) test sets.
Perhaps interested people should contact Dr. Vere at the address above
and let's go with it! GPS]

Previous 4 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 02 Aug 1996 15:28:51 +0200
From: Johannes Fuernkranz (juffi@ai.univie.ac.at)
Subject: ICML-96 WS on ILP for KDD
Hi,

you wanted to have some information on the ILP for KDD workshop.
At http://www.ai.univie.ac.at/ilp_kdd/ you can find

Aim & Scope (Call for Papers)
Schedule and Papers
Workshop Summary
Registered Participants

Cheers,

Bernhard & Juffi

--
Johannes Fuernkranz
Austrian Research Inst. for Artificial Intelligence +43-1-5336112-17(Tel)
Schottengasse 3, A-1010 Vienna, Austria, Europe +43-1-5320652(Fax)
E-mail: juffi@ai.univie.ac.at WWW: http://www.ai.univie.ac.at/~juffi

[I have included the workshop summary here. GPS]

Summary of the Workshop on ILP for KDD

by Bernhard Pfahringer and Johannes Fuernkranz

The MLNet Familiarization Workshop on 'Data Mining with Inductive Logic
Programming (ILP for KDD)' took place on July 2nd in Bari in conjunction
with ICML96. Its was focussed on the potential of ILP as a tool for data
mining and its shortcomings. Many standard methods for Knowledge Discovery
in Databases (KDD) are constrained to processing a single relational table,
whereas many real-world databases are structured into several tables
containing interrelated information. As Inductive Logic Programming (ILP)
algorithms explicitly aim at exploiting structured information, KDD should
be a fruitful research and application area for ILP. Yet, ILP algorithms
seem to be rarely used in KDD. The papers presented were grouped into
applications, algorithmical developments, and database interface issues. The
papers and other information on the workshop can be downloaded from
http://www.ai.univie.ac.at/ilp_kdd/.

The applications section consisted of three papers. Uros Pompe presented a
paper which demonstrated a successful application of learning a two-voice
counterpoint from a musical database with a stochastic ILP algorithm. Saso
Dzeroski gave an interesting talk on a chemical application, where the
relational learning algorithm FOIL performed significantly worse than
propositional learning systems. The reason was FOIL's tendency to overfit
the data even though it learned simpler rules. The most promising results on
this task were achieved by a relational instance-based learning system. The
paper by David Lorenzo, who unfortunately was not able to attend the
workshop, describes a technique for applying ILP algorithms to a wide class
of temporal databases. He demonstrated his approach with an application
where he used CLAUDIEN on a clinical database.

The algorithms section contained four presentations. The first talk argued
that conventional ILP algorithms might be too inefficient for KDD and
suggested instead the use of propositional inverse resolution for
restructuring databases. In the next talk Wim Van Laer presented techniques
for extending ICL to handle numeric data and multi-class learning
problems. Markus Wiese suggested that FOIL-like algorithms can gain
efficiency and avoid myopia by using a bi-directional search starting from
random clauses. Finally Gianni Semeraro defined a refinement operator that
is ideal for a subclass of Datalog clauses under theta-subsumption and
demonstrated its effectiveness in an application of electronic document
classification.

The last section was on integrating ILP algorithms with database systems. It
contained two complementary presentations, in which it was shown how to
interface the ILP algorithms RDT and CLAUDIEN with real databases. The
authors of both papers agreed that the restriction of most ILP systems to
have all relevant background knowledge in main memory has to be overcome in
order to tackle larger KDD tasks. They also found that a one-to-one mapping
from relational tables in a database to PROLOG relations is impractical and
suggested several alternatives.

The general discussion concluding the workshop was dominated by the quest of
defining data mining itself. A pragmatic definition based on the size of the
input description was dismissed by the majority of the attendants, but also
the somewhat ad-hoc definition of data mining as an interactive process is
controversial. Interestingly, nobody tried to define data mining or
Knowledge Discovery in Databases (KDD) by its presumed result, i.e. are we
able to (re)discover some interesting knowledge from the given data. The
least common denominator agreed to seemed to be 'ILP might serve well as one
of the tools needed for data mining' (the practitioner's view as expressed
by Gholamreza Nakhaeizadeh).

The remainder of the discussion centered on the question what sort of tools
ILP can provide for KDD along the dimensions propositional vs. relational
learning, optimizing predictive accuracy vs. discovery of explicit
knowledge, classification vs. general discovery, extensional vs. intensional
background knowledge, relational dabases vs. relational learning and
efficiency vs. complexity. The general conclusion seemed to be that one of
the main advantages of ILP is its flexibility in incorporating various forms
of background knowledge. In particular the ability of many ILP systems to
use strong language biases could proof invaluable for large KDD tasks.

Previous 5 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(if you have info requested below, please also cc to kdd@gte.com and
I will summarize. GPS).

Date: Tue, 30 Jul 1996 12:10:45 -0500
From: 'J. T. Johnson' (jjohnson@sfsu.edu)
Subject: DB Size

For a project on the history of computing, I am trying to gather
statistics that would illuminate the growth curve of data base size, 'size'
in Kbytes.
To date, I have been unsuccessful in finding many reliable, constant
sources for such data, and I would appreciate any pointers from the folks on
this list.
Also, I am trying to find information on what today is the largest
single data base in use. I have heard -- and this might well fall into the
category of contemporary folklore -- that either the Mormon church (which
maintains a huge file of genealogical records) or the government of Thailand
have DBs well into the terabyte range, both of which surpass the U.S.
Internal Revenue Service. Any leads here?
Many thanks, Tom Johnson
*****************************************************************
* J. T. Johnson San Francisco State Univ. *
* Dept. of Journalism *
* Voice: 505-473-9646 E-mail: jjohnson@sfsu.edu *
*****************************************************************

Previous 6 Next Top

>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Thu, 1 Aug 1996 09:34:47 -0700
From: datamine@ix.netcom.com (IDI)
Subject: Data mining services and tools available in Internet

We like to announce that the new data mining service is available
through internnet. Information Discovery, Inc. just annnounced its
two internet data mining applications for retail and banking industry.

Customer Retail News (TM): performs enterprise data analysis, creates
internet homepage communication channel for retail business.

Intra-Knowledge (TM): analyzes financial data, presents the
information from data mining, query, etc. in the internet newspaper
formats for bank to achieve distributed information globally.

Additional information cab be obtained from our web site at
http://www.datamining.com

Thank you
Best regards,
Diana Lin
Information Discovery, Inc.

Previous 7 Next Top

>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 30 Jul 1996 19:18:29 GMT
Subject: IDIS update
From: minedata@usa.pipeline.com (data mining1)

Gregory,

Here is the updated information, It's actually not IDIS anymore, but 'The
Data Mining Suite' and we have another product called Intra/knowledge.
Let me know if you have any questions.

Stephanie Tate

For Siftware

URL: http://www.datamining.com

Name: The Data Mining Suite(tm)

Description: The Data Mining Suite(tm) currently consists of the following
products:
IDIS: The Information Discovery System(tm) for knowledge discovery from large
databases. IDIS automatically discovers rules, patterns and anomalies from
databases. It determines the right questions to ask by itself, with or
without a pre-specified hypothesis supplied by the user and discovers
patterns hidden in large databases. This discovery may be guided and
influenced by the user, or IDIS may be set free to roam the database by
itself, discovering totally unexpected relationships. The Mark Twain
Edition of IDIS automatically generates narrative sentences describing the
influence relationships in the data. The reports are automatically
formatted as English text.
IDIS: Predictive Modeler(tm) for prediction and forecasting. The IDIS-PM
makes predictions and forecasts by using the rules and patterns which IDIS
generates. IDIS-PM performs pattern matching to make predictions based on
the application of these rules.
The Map Discovery System(tm) for the discovery of geographic patterns in
databases. Map/IDIS automatically analyzes data associated with maps,
discovering interesting and significant geographic patterns by itself.
Map/IDIS uses Information Discovery's Geographic Reasoning Engine to
automatically analyze large amounts of geographically oriented data. It
applies knowledge of geography to database analysis. Map/IDIS works in
conjunction with a Geographic Reasoning System (GIS). The GIS plots what
Map/IDIS discovers. While IDIS performs pattern discovery to find
interesting rules, Map/IDIS performs pattern discovery to find interesting
maps.

Discovery Tasks: Classification, Clustering, Summarization, Deviation
Detection, Dependency analysis, Geographic pattern discovery, Prediction,
Text generation.

Comments: IDIS accesses large SQL data bases directly without flat files
or small extracts, it is more powerful than decision trees, IDIS:PM gives
fully restrained and explainable predictive modeling, uses system
initiative in the data mining process, not sensitive to noise, can deal
with data sets of 50 to 100 million records on parallel machines

Platforms: The Data Mining Suite has a three tiered client server
architecture which allows MS-Windows and/or NetScape clients to access SQL
database servers on platforms such as Hewlett Packard's HP 9000, IBM
RS/6000, SUN UltraSparc, and Unisys servers.

Contact: Information Discovery, Inc
703B Pier Avenue, Suite 169
Hermosa Beach, CA 90254
(310) 937-3600 (phone)
(310) 937-0967 (fax)
datamining@ix.netcom.com

Status: Commercial Product
Source of Information: Vendor

For Siftware

URL: http://www.datamining.com

Name: INTRA/Knowledge(

Description: INTRA/Knowledge, targeted toward the financial industries,
uses an Intranet server that supports datamining and delivers readable
English text describing the status of a bank or financial institution's
business. Information on market segment differentiation and influence
summaries are included. This text is specifically tailored to the needs of
each group of users.

INTRA/Knowledge combines data analysis with the internet, and is formatted
as a customized dynamic newspaper -- one that is automatically generated
from the database. It reads through the vast data generated daily by
financial institutions, discovering key trends contained in that
information base. It then converts it into easy-to-understand English
text, delivered over the company's Intranet. Each user can access
up-to-the minute 'custom-tailored' information that is most relevant at any
hour of the day or night. There is no required learning for this system
other than knowledge of the day to day e-mail system.

This vertical application fully operates on the corporate INTRAnet and uses
data mining to generate fully readable text solutions thus merging
structured data in SQL repositories with text in HTML format.

Discovery Tasks: Classification, Clustering, Summarization, Deviation
Detection, Text generation

Comments: Accesses the SQL data base directly, fully restrained and
explainable predictive modeling, uses system initiative in the data mining
process, not sensitive to noise, can deal with data sets of 50 to 100
million records on parallel machines, automatic text generation on the
inter/intranet, clearly explaining the contents of a database following a
data mining process

Platforms: The Data Mining Suite has a three tiered client server
architecture which allows MS-Windows and/or NetScape clients to access SQL
database servers on platforms such as Hewlett Packard's HP 9000, IBM
RS/6000, SUN UltraSparc, and Unisys servers.

Contact: Information Discovery, Inc. and Unisys Corp.
703B Pier Avenue, Suite 169
Hermosa Beach, CA 90254
(310) 937-3600 (phone)
(310) 937-0967 (fax)
datamining@ix.netcom.com

Status: Commercial Product

Source of Information: Vendor

Previous 8 Next Top

>~~~Meetings:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: zicari@informatik.uni-frankfurt.de
Subject: Web address
Date: Tue, 30 Jul 1996 18:43:20 +0200 (METDST)

The full conference programs of
Object World Frankfurt 96 and Internet Forum Europe 96
-October 9-11, Frankfurt/Main-

are on line at:

http://www.ltt.de

For any inquiry, please e-mail: roberto_zicari@omg.org

Regards

Roberto Zicari

OWF/IFE Chair

Previous 9 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~