Knowledge Discovery Nuggets 97:08

To KD Mine: main site for Data Mining and Knowledge Discovery.
Here is how to subscribe to KD Nuggets
Past Issues: 1997 Nuggets, 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Knowledge Discovery Nuggets 97:08, e-mailed 97-02-28

* GPS, New Location for KD Mine and KD Nuggets:
* W. Kloesgen, KDD-97: Second Call For Panel Proposals
* P. Maiste, Price Waterhouse announces new data mining services
* T. Denecke, Query: Data Mining and Workflow Management ?
* D. Throop, Query: Finding approximately duplicate records ?
* P. Stolorz, CFP: DMKD special issue on scalable computing

* G6G, Intelligent Software Web Site,
* W. Buntine, summer students and scientist positions in
autonomous data analysis
* B. Masand, KDD Job at GTE Laboratories, Waltham, Ma
* S. Wrobel, Two positions in Machine Learning/Data Mining at GMD

Knowledge Discovery Nuggets is a free electronic newsletter for the Data Mining and Knowledge Discovery community, focusing on the latest research and applications.

Submissions are most welcome and should be emailed, with a DESCRIPTIVE subject line (and a URL) to gps. Please keep meeting announcements short and put all the details on the meeting web page !

To subscribe, see

KD Nuggets frequency is 3-4 times a month. Back issues of KD Nuggets, a catalog of data mining tools ("Siftware"), and a wealth of other information on Data Mining and Knowledge Discovery is available at Knowledge Discovery Mine site

-- Gregory Piatetsky-Shapiro (editor)
********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or KD Nuggets) *

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'An experimental science is supposed to do experiments
that find generalities. It's not just supposed to
tally up a long list of individual cases and their
unique life stories. That's butterfly collecting.'
Richard C. Lewontin, biology professor at Harvard University
Thanks to Yolanda Gil

Previous  1 Next   Top
Date: Fri, 28 Feb 1997 09:41:10 -0500 (EST)
From: Gregory Piatetsky
Subject: New Location of KD Mine --

I have set up a new location for Knowledge Discovery Mine web site
-- --
which is operational today, Feb 28, 1997.

I will continue to maintain and improve that site in my new job --

The GTE location at will remain for some time,
but I will not be updating it.

I will also continue to edit and email Knowledge Discovery Nuggets
(I have dropped the second D to emphasize the more general focus).
It will be gradually transitioned to site,
but in the meantime will continue be distributed from GTE.
The changeover should be transparent to all subscribers.

Gregory Piatetsky-Shapiro

please address KD Nuggets related email to gps

Previous  2 Next   Top
Date: Wed, 26 Feb 1997 14:41:27 +0100
From: (Willi Kloesgen)
Subject: KDD-97 organization -- call for panels

As in previous KDD conferences, the KDD-97 program will include panel
discussions. A great panel requires an interesting topic, good
speakers, and proper preparation. To facilitate all three we solicit
early suggestions. Please submit suggestions for topics and preferably also
for panelists who could represent diverse positions or approaches of the
topic. Suggested topics should relate to any of the main KDD-97 topics (see
The panel topics should be of general interest for a
large part of the KDD audience and allow several (controversial) approaches
to be discussed.

Please email informal suggestions by April 2, 1997 (earlier if possible) to:

Willi Kloesgen

Previous  3 Next   Top
Date: Fri, 21 Feb 1997 13:34:22 +0100
From: Tom Denecke (
Subject: Data Mining and Workflow Management

I am a student of Business Science and working in a research project
'controlling of workflow processes'.

My idea is to use data mining techniques to evaluate the control data
of workflow systems. My problem is that I am not very familiar with that
technical terms. So it would be great to get a hint which methodolgies
would fit to this application domain.

Here is a little description which kind of information can be achieved:

There several process instances of each process type(for example
After the execution of 100 instances, there exist a lot a data for this
process type, which can be explored.

- processing and idle time
- who executed the process (employee, role, orga. unit)
- which kind of workflow
- which activities were executed
- data about the process object (which customer, article ...)
- which other processes are running
- metrics concerning quality and cost of a process/activity
- ...

We would like to generate rules about the process performance
(bottle neck detection, when does a process perform well,..).

I would be very kind to get a little information, if there a similar
problems, which are solved by data mining techniques or just literature

Thank you very much

Tom Denecke
- MBA -

WWU Muenster
Rudolf-Harbig-Weg 24
48149 Muenster
PHONE + 49 251 89 75 65

Previous  4 Next   Top
[The following is a commercial announcement. GPS]

Date: Fri, 28 Feb 97 08:09:07 EST
Subject: Press Release: Opening of a Knowledge Discovery Center


Price Waterhouse Management Consulting in New York:
Jan Butler
212- 819-4838,
Liza Kurtz
212-995-5680, ext. 210,

New York, NY - February 26 - Price Waterhouse Management Consulting, a
recognized leader in delivering data warehouse services to global companies,
introduces Data Mining Services for helping clients achieve strategic value
from the mounds of data often accumulated in the course of business. An
integrated offering of Price Waterhouse's Global Data Warehouse Practice, the
Data Mining Services range from introductory seminars on data mining and
knowledge discovery to full data mining system implementations. To support
these offerings, Price Waterhouse has opened the Knowledge Discovery Center in
Bethesda, Maryland
'Data mining has recently moved to the forefront of business executive's
strategic data warehouse initiatives, driven by a significant growth in the
amount of data that companies collect on their customers, processes, and
finances,' said Mike Schroeck, Global Data Warehouse Practice Leader for Price
Waterhouse. Data mining technologies use sophisticated, automated algorithms to
discover hidden patterns, correlations, and interacting relationships among the
hundreds of strategic data elements collected by an organization. The impact of
data mining on a company's bottom line, whether through increased revenues or
decreased costs, is often enormous.
A leader in data mining knowledge and research, Price Waterhouse has performed
a comprehensive, hands-on evaluation of many of the leading data mining tools
currently available on the market, and has spoken at a variety of conferences
and trade shows on the subject. With years of analytical modeling and data
analysis experience, Price Waterhouse can help clients get the greatest return
on their data mining investment. 'We are dedicated to offering value-added data
mining analyses to our clients. The time for businesses to take advantage of
these tools and algorithms has never been better,' says Dr. Glenn Galfond,
Partner in charge of Price Waterhouses Management Analytics practice, which is
spearheading the firms Data Mining Services.
The Data Mining Services offered by Price Waterhouse include Data Mining 101,
Data Mining Proof, Data Mining Service, and Data Mining Solutions. Data Mining
101 is a half-day beginner's course in data mining. The course provides an
overview of the technology, examples of how it has been successfully used, and
a demonstration of the leading data mining tools. Data Mining Proof is a short
proof of concept project, in which Price Waterhouse mines a small extract of a
client's data for quick, but rewarding results. This allows the client to see
data mining's potential in a hands-on environment. Clients also receive a copy
of PW's comprehensive Data Mining Tool Evaluation report.
For companies that are ready to delve more deeply into data mining but do not
have the necessary in-house resources, Data Mining Service offers a full range
of data mining outsourcing options, including data extraction, data cleansing,
and data mining. For companies that wish to implement enterprise-wide data
mining systems, Data Mining Solutions offers Price Waterhouse's proven data
mining and data warehousing methodology and full-scale systems implementation
The Knowledge Discovery Center will be used to support these services and to
provide an environment for demonstrating the latest data mining tools and train
clients in their use. Price Waterhouse has equiped the Center with many of the
leading data mining tools. The technologies and algorithms available in the
Center encompass the full-breadth of data mining capabilities. Galfond adds,
'Price Waterhouse has invested heavily in the research and evaluation of the
leading data mining tools. Our clients can take advantage of this investment
while reaping the benefits that data mining brings to their companies.'
Price Waterhouse Management Consulting delivers enterprise-wide solutions to
large multinational clients through integrated Information Technology and
Change Integration services. With in-depth knowledge of selected industries
and business process expertise, Price Waterhouse Management Consulting works
with clients worldwide, from strategy through implementation, to help them
improve business performance. Price Waterhouse Management Consulting services
are provided in the U.S. by Price Waterhouse LLC.

Previous  5 Next   Top
{Please cc responses to the
since the problem is of general interest. GPS]

From: 'Throop, David R' (
Subject: Looking for phrase matching tool
Date: Tue, 25 Feb 1997 10:03:30 -0600

Dr. Piatetsky-Shapiro,

Thank you for your excellent website on data mining. I'm hoping you
might help me, or point me towards someone who can.

I'm looking for a piece of commercial software that may or may not
exist. I couldn't find it on your pages, but your stuff is the closest
I've found. So I'm asking you for any pointers.

We have several databases which have lists of components (pieces of the
International Space Station.) These databases have no common key. They
do, however, have english-language descriptions of the components (on
the order of 20 - 50 characters long.) However, these descriptions are
not identical. For instance, a certain power switch is known by two
different names:
RPCM N1-3B-C Switch14 and N1-3B-RPCM-C-RPC-14
As you see, the order of the identifiers is different, one set uses the
term 'switch' where another uses 'RPC', and the '14' is concatenated
with no space on one side.

Anyway, I'm looking for a piece of software that could go through the
databases, (armed with a dictionary, list of abbreviations, synonyms
etc) and come up with a set of best guesses about which items match.

Do you know of such a tool, either as a commercial product or a research

David Throop
281 212 9369

Previous  6 Next   Top
Date: Thu, 27 Feb 1997 22:43:45 -0800 (PST)
Subject: CFP for DMKD special issue on scalable computing



Special Issue on
Scalable High-Performance Computing for KDD

Guest editors: Paul Stolorz and Ron Musick

Traditional computational techniques and computer architectures are
routinely overwhelmed by the sheer volume and complexity of information
generated from data-gathering instruments, computational and
experimental methodologies, and business operations. The fundamental
problem of extracting knowledge and insight from massive databases and
datasets is shared across a wide range of fields in business,
academia and government. The new field of Data Mining and Knowledge
Discovery in Databases (KDD) has arisen as an interdisciplinary response
to this situation, merging ideas drawn from disciplines such as statistics,
pattern recognition, machine learning, databases, visualization and
high performance computing.

This special issue of Data Mining and Knowledge Discovery is devoted
to the challenge of applying data mining and knowledge discovery methods
to large, complex datasets. Implementation of data mining ideas in
high-performance computing environments is crucial for coping with
large-scale data. In particular, parallel and distributed systems are
needed to ensure system scalability as datasets grow inexorably in size
and scope. These environments include dedicated massively parallel
supercomputers, super-servers built from clusters of commodity
workstations and high-speed network interfaces, and heterogeneous
networks distributed over regional, national and global scales.
High-performance and parallel computing holds the promise of scaling
to large data sets, allowing the data mining component to search a much
larger set of patterns and models than traditional computational platforms
and algorithms would allow. In addition, it promises to render the KDD
process much more interactive by allowing fast response times for
difficult search and model fitting problems.

Data Mining and Knowledge Discovery, published by Kluwer Academic
publishers, is the flagship publication in the rapidly growing area of
KDD. In this special issue we solicit the most dramatic new
developments in high performance large-scale KDD applications, highlighting
the promise of the technology and identifying the main challenges for
the future. Technically innovative papers that describe new theoretical
developments, or tackle the application of practical data mining
approaches to real problems and datasets on parallel and distributed
architectures, are solicited. Topics of interest include, but are
not limited to, the intersection of KDD with the following fields:

Parallel implementations of datamining & KDD methods:
Classification and regression: e.g. decision trees, neural nets
Pattern recognition
Belief nets and other Bayesian approaches
Genetic programming
Association rules
Statistical inference
Similarity detection and measurement
Clustering and density estimation
Text retrieval
Content-based indexing
Data visualization
Trend Analysis

Integration of KDD techniques with scalable I/O systems:
Data warehouses & federated databases
Parallel file systems
High-performance network interfaces
Intelligent data layout
Out-of-core algorithms
Parallel relational querying
High performance storage systems
Hierarchical and distributed storage

Methods to control complexity:
Random sampling
Anytime algorithms applied to datamining techniques
New complex data-type algorithms (eg. not based on feature vectors)
Domain simplification techniques
Inference error/confidence characterization

Parallel, clustered and/or distributed applications:
Datamining on commodity-based clusters and networks
Web-oriented datamining
Novel applications and case studies
Knowledge discovery systems and tools

Electronic submissions are STRONGLY ENCOURAGED. Postscript copies
of papers may be emailed to Latex style
files and related instructions can be obtained at the web site



Enquiries about the submission process and scope of the special issue
may be sent to

Previous  7 Next   Top

Previous  8 Next   Top
[The following is a commercial announcement. GPS]

Date: Mon, 24 Feb 1997 22:47:04 -0500
Subject: SAIC and G6G Develop an Intelligent Software Web Site

'SAIC and G6G Develop an Intelligent Software Web Site'

NEW Web-Site Address is:

Science Applications International Corporation's (SAIC) Asset Source for
Software Engineering Technology (ASSET) Division has teamed up with G6G
Consulting Group (G6G) and co-developed a ground breaking new World Wide
Web (Web) site focused on 'intelligent software.'

The new site contains the entire content of 'The G6G Directory of
Intelligent Software,' a publication that contains over 750 abstracts
covering 15 advanced technology corridors.

'The G6G Directory of Intelligent Software' contains product abstracts in
Expert (Knowledge-Based) Systems, Fuzzy Logic, Hypermedia, Hypertext and
Multimedia, Intelligent Software Tools, Neural Networks, Object-Oriented
Programming, Virtual Reality, Voice & Speech Systems, and other areas.
The directory is further categorized by over 140 sub-categories of 'what'
the product can be used for or 'what it is' such as:

- Data Mining - Manufacturing Systems
- Diagnostic Systems - Modeling
- Help Desk Systems - Network Systems
- Help Authoring Systems - Stock Market
- Knowledge Management - Software/Hardware
- Lending and Learning Systems - Software Development
- Customer Support Systems - and many others.

The directory content on this Web site will be updated on a weekly
basis. The combination of G6G's directory and ASSET's on-line free and
commercial product inventory will present a powerful complement of
information on the Web. Knowledge engineers, software engineers,
developers and other users of intelligent software products will find to be extremely useful.

This valuable free resource will help create a sense of community in the
world of intelligent software by providing an on-line source of
searchable information about intelligent software products and vendors.

The G6G Directory of Intelligent Software
SAIC/ASSET G6G Consulting Group
(304) 284-9000 (310) 458-4187

Previous  9 Next   Top

Previous  10 Next   Top
Date: Tue, 18 Feb 1997 14:39:31 -0800
From: Wray Buntine (
Subject: summer students and scientist positions in autonomous data analysis

Please note the two sets of positions below.
Research scientist
2 summer students, or longer term support for PhD
The summer student position could be transferred into
longer term support for focussed PhD research if the
interest is right.

Wray Buntine

======================= Scientist

NASA's Center of Excellence in Information Technology at
Ames Research Center invites candidates to apply for a position as
Research Scientist in Information Technology:

Position description:

* We seek applicants to join a small team of space scientists and
computer scientists in developing NASA's next generation smart spacecraft
with on-board, autonomous data analysis systems. The group includes
leading space scientists (Ted Roush, Virginia Gulick) and leading data
analysts (Wray Buntine, Peter Cheeseman), and their counterparts at JPL.
* The team is doing the research and development required for
the task, and has a multi-year program with deliverables
planned. This is not a pure research position, and requires
dedication in seeing completion of the R&D milestones.
* The applicant will be responsible for the information technology side
of R&D, with guidance from senior space scientists on the project.
* The research has strong links with on-going work at the Center of
Excellence and is an integral part of NASA's long term goals.

Candidate requirements:

* Strong interest in demonstrating autonomous analysis systems to
enhance science understanding in operational tests, with the ultimate
goal of putting such systems in space.
* Ph.D. degree in Computer Science, Electrical Engineering, or related
field, and applied experience, possibly within the PhD. In
exceptional cases, an M.S. degree with relevant work experience will
* Knowledge of neural or probabilistic networks, machine learning,
statistical pattern recognition, image processing, science data,
processing, probabilistic algorithms, or related topics is essential.
* Strong communication and organizational skills with the ability to lead
a small team and interact with scientists.
* Strong C programming and Unix skills (experimental, not
necessarily production), with experience in programming mathematical
algorithms: C++, Java, MatLab, IDL.

Application deadline:

* March 15th, 1997 (hardcopy required -- see below).

Please send any questions by e-mail to the addresses below, and type
'PI for Autonomous data analysis' as your header line.

Dr. Ted Roush:
Dr. Wray Buntine:

Full applications (which must include a resume and the names and addresses
of at least two people familiar with your work) should be sent by surface
mail (no e-mail, ftp or html applications will be accepted) to:

Dr. Steve Lesh
Attn: PI for Autonomous data analysis
Mail Stop 269-1
NASA Ames Research Center
Moffett Field, CA, 94035-1000

============================== Summer students or Student Assistantship

NASA's Center of Excellence in Information Technology at
Ames Research Center invites current PhD students to apply for
a summer position (possibly two available).

Position description:

* We seek applicants to join a small team of space scientists and
computer scientists in developing NASA's next generation of smart
space-craft on-board, autonomous data analysis systems. The group
includes leading space scientists (Ted Roush, Virginia Gulick) and
leading data analysts (Wray Buntine, Peter Cheeseman).
* We are working with spectrometers and a CCD camera, and are
building resource-bounded autonomous classification systems,
and trainable object recognizers.
* The successful student will have considerable flexibility
within the goals of the project to contribute.
* An ideal summer project would produce demonstration software together
with a conference paper.

Candidate requirements:

* Knowledge of neural or probabilistic networks, machine learning,
statistical pattern recognition, image processing, science data,
processing, probabilistic algorithms, or related topics is essential.
* Strong C programming and Unix skills (experimental, not
necessarily production), with experience in programming mathematical
algorithms: C++, Java, MatLab, IDL.
* Interest in revisiting the project at a later date.

Application deadline:

* We will accept applications on a continuing basis until
the beginning of summer, and will take good applicants as they apply.

Please send any questions by e-mail to the addresses below, and type
'PI for Autonomous data analysis' as your header line.

Dr. Ted Roush:
Dr. Wray Buntine:

Full applications (which must include a resume and the names and addresses
of at least two people familiar with your work) should be sent by surface
mail (no e-mail, ftp or html applications will be accepted) to:

Dr. Steve Lesh
Attn: summer student for Autonomous data analysis
Mail Stop 269-1
NASA Ames Research Center
Moffett Field, CA, 94035-1000

Previous  11 Next   Top
Date: Fri, 21 Feb 1997 14:11:12 -0500
From: (Brij Masand)
Subject: KDD Job at GTE Laboratories, Waltham, Ma

**** An Outstanding Applied Researcher/Developer needed for the **********
**** Knowledge Discovery in Databases project at GTE Laboratories **********

Description: Participate in the design and development of
state-of-the-art systems for data mining and knowledge discovery. The
focus of the job is on applied research in KDD, including development
of prototypes to demonstrate innovative business applications of KDD.

The candidate will join one of the leading R&D teams in the
area of data mining and knowledge discovery. Our current projects
include predictive customer modeling for GTE's cellular telephone
markets. We are applying multiple learning and discovery methods to
very large, high-dimensional real-world databases, involving millions
of records and Gbytes of data and have created KDD-based solutions
that are being deployed in the field.

The ideal candidate will have a Ph.D. in Machine Learning or
related fields and 2-3 years of experience, or an M.S. with equivalent
experience. The candidate should have experience with machine
learning algorithms, be familiar with statistical theory, have
practical experience with databases, and be proficient with
Web/Internet tools. Excellent coding skills in C/Unix environment and
an ability to quickly pick up new systems and languages are needed. Good
communication skills, the ability to work in a team, and good coding
and system maintenance practices are very desirable.

GTE Laboratories incorporated, located in Waltham, Ma is the central
research facility for GTE. GTE is among the the largest local
exchange telephone carriers and the second largest mobile service
provider in the United States. Our research facility is located on a
quiet 50 acre campus-like setting in Waltham, MA, 20 minutes from
downtown Boston. Our salaries are competitive, and our outstanding
benefits include medical/life/dental insurance, saving
and investment plans, and an on-site fitness center.

Please send a resume and a cover letter
(preferably by e-mail, in ASCII) to:

or by fax to 617.466.3342 (Attn: Brij Masand)

I will be travelling till Mar 12th and will reply to email responses
after that. thanks! -- Brij Masand (

Previous  12 Next   Top
Subject: Two positions in Machine Learning/Data Mining at GMD
Date: Fri, 28 Feb 97 13:55:06 +0100

Two positions in Machine Learning/Data Mining at GMD

GMD's FIT.KI department (the AI research division of the
Institute for Applied Computer Science) is looking to
fill two scientist positions (M.S./Diplom or postdoc level) in the area of

Machine Learning/Data Mining.

We are looking for excellent people with a strong background in one
or both of these areas, preferably combining both theoretical/scientific
and application/software-engineering skills. Applications at both the
postdoctoral and the M.S. level are welcome.

You will be working as a research scientist in one of our current
ML/DM projects, KESO or ILP2, and will be part of FIT's data mining
group consisting of currently 4 people. Scientific work, writing and
presentation of papers, and application and software work will both be
part of your job. M.S. level applicants will be given time to complete their
Ph.D.s while at GMD.

Both positions are to be filled as soon as possible, for a period of initially
two or three years, renewable for up to five years. Salary is according to
the BAT IIa tariff, in the range of approx. DEM 50.000 to DEM 80.000 depending
on age, qualifications, and marital status. For more information about FIT.KI, see, for more information about the ML/data mining group, see

If you are interested in such a position, please send your application
material to
Dr. Stefan Wrobel
Schloss Birlinghoven
53754 Sankt Augustin
to be received no later than March 23, 1997 (preferably by paper mail,
but E-Mail is o.k. if otherwise you cannot meet the deadline). Please
include at least a brief curriculum vitae, description of your qualifications,
research experience and future research interests, degree/grade information
(if relevant) and if applicable, a selection of three of your best publications
(full text copy). We are looking forward to your application!

Dr. Stefan Wrobel
GMD -- German Natl. Research Center for Information Technology
FIT.KI, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
Tel.: +49/2241/14-0, Fax: -2889 E-Mail:
Secr.: D. Boethgen Tel. -2731, E-Mail:

Previous  13 Next   Top

Previous  14 Next   Top