Positions: *
Gregory Piatetsky-Shapiro, Data Mining Company looking for
experts in decision trees and/or bayesian networks *
Donal Lyons, Data Mining Research Position in Ireland *
Yike Guo, Data Mining Job at Fujitsu (Japan) Meetings: *
Pavel Brazdil, The Workshop on 'Extraction of Knowledge from Data Bases' (EKBD'97), Coimbra, Portugal, October 6-9, 1997
--
Data Mining and Knowledge Discovery community, focusing on the
latest research and applications.
Submissions are most welcome and should be emailed, with a
DESCRIPTIVE subject line (and a URL) to gps.
Please keep CFP and meetings announcements short and provide
a URL for details.
KD Nuggets frequency is 3-4 times a month.
Back issues of KD Nuggets, a catalog of data mining tools
('Siftware'), pointers to Data Mining Companies, Relevant Websites,
Meetings, and more is available at Knowledge Discovery Mine site
at
********************* Official disclaimer ***************************
All opinions expressed herein are those of the contributors and not
necessarily of their respective employers (or of KD Nuggets)
*********************************************************************
~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
About the Deep Blue -- Kasparov match,
'I just think we should look at this as a chess match,' he said, 'between the
world's greatest chess player and Garry Kasparov.'
Louis Gerstner, IBM Chairman Previous1NextTop
Date: Thu, 8 May 1997 09:41:10 -0500 (EST)
From: GPS (gps)
Subject: First Issue of DMKD journal
The first issue of DMKD journal has finally been published!
see
The beautiful black and white cover shows an Escher-inspired picture
of several robots inside a mysterious structure (a data mine?), and
contents include
an editorial by Usama Fayyad, 4 excellent technical papers,
* Statistical Themes and Lessons for Data Mining
Clark Glymour, David Madigan, Daryl Pregibon, Padhraic Smyth
* Data Cube: A Relational Aggregation Operator Generalizing Group-by,
Cross-Tab, and Sub Totals
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh
* On Bias, Variance, 0/1 - loss, and the Curse-of-Dimensionality
Jerome H. Friedman
* Bayesian Networks for Data Mining, David Heckerman
and a brief application summary:
* Advanced Scout: Data Mining and Knowledge Discovery in NBA data,
Inderpal Bhandari, Ed Colet, Jennifer Parker, Zachary Pines, Rajiv Pratap, Krishnakumar Ramanujam
Sample copies of first issue will be mailed soon.
Previous2NextTop
Date: Wed, 30 Apr 1997 11:09:50 +0200 (MET DST)
From: Gerhard Widmer (gerhard@ai.univie.ac.at)
Subject: CfP: MLJ Special Issue on Context Sensitivity and Concept Drift
Machine Learning Journal
Special Issue on Context Sensitivity and Concept Drift
Miroslav Kubat and Gerhard Widmer, Guest Editors
MOTIVATION AND RESEARCH ISSUES
In many machine learning applications, the features given to the
learning program do not capture all aspects of the application problem.
This is a limitation shared with all forms of modeling -- even the
person who formulates the learning problem may not be aware of all of
the relevant context. Examples from the history of machine learning
and pattern recognition include omitting illumination features in
computer vision and omitting language accents in speech recognition
systems. A similar problem arises when the relevant features are
included, but the training examples do not provide enough variation
of those features to permit the learning algorithm to detect their
relevance. For example, if foreign accent features are included in a
speech recognition system, but all training examples are from native
speakers, then the foreign accent features will be ignored by the
learning system.
Relevant context may also change with time, so that a classifier
trained on one set of training examples (where a contextual feature
was absent or held constant) may suddenly begin to perform badly when
the context changes. Gradual or abrupt changes in context often
become apparent in the form of {em concept drift}. For situations
where a concept gradually evolves over time in a certain general
direction (such as the concept ``computer''), the term {em concept
evolution} has sometimes been used. Tracking concept drift on-line
requires a learner to continually monitor its performance and adjust
its hypotheses if necessary. It might also require the learner to
'forget' old, outdated information.
In batch learning, problems may arise if the training data were
collected in batches from different contexts, or if the training
data were gathered in one setting but the test data are drawn from
a different setting. Again, effective learning requires the recognition
of such discontinuities and the ability to adapt hypotheses to
different conditions.
This special issue is devoted to theoretical and empirical studies
of methods for detecting missing context, tracking concept drift,
adapting learned knowledge to new contexts, and identifying and
reasoning about contextual effects and concept changes in learning.
We encourage submissions addressing one or more of the following
research issues:
. on-line tracking of concept drift and concept evolution
. theoretical results concerning concept drift and contextual influences
. formal definitions of context and its effects on concept learning
. real-world applications involving context changes and/or concept drift
. representation of context-sensitive concepts
. representation of context
. recognition of context and reasoning about context
. adaptation of learned knowledge to new contexts
Both theoretical and more practically oriented papers are welcome,
but we do encourage papers that provide real-world examples of context
sensitivity and concept drift and compare multiple ways of addressing
the problems that arise.
SUBMISSION INFORMATION:
The expected length is 8000-12000 words for a full paper, or 2000-4000
words for a Research Note (full-page figures count for 400 words).
Electronic submission via e-mail is STRONGLY ENCOURAGED. Postscript
files (compressed or gzipped, uuencoded) should be sent to
gerhard@ai.univie.ac.at.
For hardcopy submissions, please send 5 copies of the manuscript to:
Gerhard Widmer
Austrian Research Institute for Artificial Intelligence
Schottengasse 3
A-1010 Vienna
Austria
Tel: +43-1-53532810
Fax: +43-1-5320652
e-mail: gerhard@ai.univie.ac.at
The special issue is scheduled to appear in the summer of 1998.
Previous3NextTop
Date: Mon, 28 Apr 1997 13:38:14 +0200
To: gps
From: Aleksander Oehrn (Aleksander.Oehrn@idi.ntnu.no)
Subject: Rosetta availability
===================================================
Rosetta -- A Rough Set Toolkit for Analysis of Data
===================================================
Rosetta is a toolkit for analyzing tabular data within the framework of
rough set theory, and consists of a computational kernel and a GUI
front-end. The Rosetta GUI reflects the contents of the kernel, and runs on
PCs operating under Windows NT or Windows 95.
A limited version of Rosetta is made publicly available for non-commercial
use. The downloadable program is limited in the sense that algorithms from
the embedded RSES library are not applicable to decision tables larger than
some predetermined size (currently 500 objects and 20 attributes).
The software (including documentation) is provided 'as is' without warranty
of any kind.
Kernel architecture and front-end designed and implemented at the Knowledge
Systems Group, Dept. of Computer and Information Science, Norwegian
University of Science and Technology, Norway. Sections of the computational
kernel (RSES) developed at the Logic Group, Inst. of Mathematics,
University of Warsaw, Poland.
Rosetta is designed to support the overall KDD process; from initial
browsing and preprocessing of the data, via reduct computation and rule
generation, to validation and analysis of the extracted rules.
Some of the features currently offered by the computational kernel include
amongst others:
- Completion of decision tables with missing values
according to various completion strategies.
- Computation of partitions and rough set approximations
within the variable precision model.
- Sampling of subtables for validation purposes.
- Discretization of numerical attributes with various
discretization algorithms.
- Computation of reducts (both in the standard sense as well
as object-related ones). Various approximation algorithms
(e.g. genetic algorithms) are offered, as well as exhaustive
computation via discernibility matrices. Dynamic reducts can
be computed.
- Generation of propositional rules.
- Shortening and pruning of sets of reducts and rules.
- Exporting of rules, reducts and tables, e.g. to Prolog.
- Application of synthesized rules to unseen examples by means
of various classification strategies, e.g. voting.
- Generation of confusion matrices.
Some of the features currently offered by the Rosetta GUI include amongst
others:
- Full Windows GUI conformance.
- Organization of project items in a tree-structure in order to
retain data-navigational abilities.
- Viewing of all structures in intuitive grid environments, using
terms from the modelling domain.
- Context-sensitive menus.
- Drag and drop functionality.
- Masking of attributes, enabling one to work with 'virtual'
tables.
- Automatic generation of annotations, thus documenting the
modelling session.
- A prototype environment for interactive classification and guidance
on the basis of incomplete information, using a selected set of
synthesized rules.
- On-line help. Previous4NextTop
Date: Wed, 7 May 1997 17:37:13 -0400
From: Larry Bouchie (lbouchie@lnscom.com)
Cognos' Scenario data mining product was released
last month. Cognos' main Web page is at
COGNOS UNVEILS SCENARIO FOR DATA MINING
-- New Data Mining Software Joins Cognos' Market-Leading Business
Intelligence Tools, PowerPlay' For OLAP And Impromptu' For Query &
Reporting --
BURLINGTON, MA, March 3, 1997 -- Cognos (NASDAQ:COGNF; TSE:CSN) today
announced its newest business intelligence tool, Scenario, for
enterprise-wide guided data analysis and data mining. Scenario extends the
industry's most comprehensive business intelligence product family, joining
Cognos' market-leading PowerPlay, the universal online analytical
processing (OLAP) client, and the award-winning Impromptu query and
reporting tool.
Designed for spotting patterns and exceptions in business data that might
otherwise be missed, Scenario's sophisticated interface allows users to
readily visualize the business information being uncovered. It automates
the discovery and ranking of critical factors impacting a business, exposes
hidden relationships between factors and establishes thresholds and
benchmarks. An intuitive, cost-effective desktop tool, Scenario liberates
data mining from what is typically an expensive and time-consuming process.
Insights derived using Scenario are achieved directly by those best
positioned to use the knowledge and effect rapid change.
Designed to support faster business decision-making, Scenario:
* makes data mining immediately accessible to decision makers;
* simplifies business data analysis by filtering out insignificant business
variables and relationships;
* validates business hypotheses by showing and ranking critical factors and
relationships;
* leads to new business insights by automating information discovery; and
* integrates with Impromptu and PowerPlay as best-of-breed components in
the Cognos enterprise business intelligence solution.
'With Scenario, Cognos is delivering a very important technology to
business analysts,' said George Azrak, national director of IS development
at Domino's Pizza. Domino's Pizza has been working with early versions of
Scenario, and has provided Cognos with valuable input from an end user's
point of view.
'Accessible data mining is the long-awaited third wave in the data
warehousing revolution,' said Alan Rottenberg, Cognos' senior vice
president, Business Intelligence Tools. 'First query and reporting brought
data to the desktop, then OLAP technologies enabled the convenient
navigation of massive data warehouses. Data mining is the technological
leap that automates the information discovery process.
Rottenberg continued, 'Impromptu gives access to the numbers and data on
which a business runs. PowerPlay lets individual managers explore that
data without an army of programmers. Scenario works alongside both of
those products to refine business data to distinguish what really matters.
Drawing a straight line to the bottom line, this product completes the
spectrum of business intelligence tools that can arm knowledge workers with
the insight to truly understand the data that drives a business -- and to
reap the competitive rewards.'
Scenario uses statistical methods that go beyond 'tree' analysis. For
example, one such method is a data segmentation capability based on CHAID
(Chi-Squared Automatic Interaction Detection) technology. CHAID allows
users to find statistically relevant relationships and trends within large
repositories of business data by 'refining' it down to the most useful
nuggets that have the greatest effect on the results being tracked.
Subsequent releases of Scenario will include neural-network modeling and
forecasting capabilities, using technologies from recently acquired Right
Information Systems.
Pricing and Availability
Available from Cognos for $695, Scenario 1.0 for Windows 95 or Windows NT
requires an IBM-compatible 486 PC and 8 MB of RAM.
Previous5NextTop
Date: Thu, 8 May 1997 10:40:10 -0500 (EST)
From: Gregory Piatetsky-Shapiro (gps@genevecon.com)
Subject: Looking for experts in decision trees and/or bayesian networks
** Data Mining Consulting and Integration Company is looking for
experts in decision trees and/or bayesian networks **
TASK: Participate in the design, development, and deployment of leading
edge integrated data mining and customer modeling systems, primarily in
the financial area. Perform quick data mining studies using a variety of
different approaches and tools.
The candidates will join a team of world-class experts in data
warehousing, data mining and knowledge discovery.
Ideal candidates will have a Ph.D. in Machine Learning, Statistics,
or related fields and 2-3 years of experience, or an M.S. with an
equivalent experience. The candidates should have expertise with
different modeling approaches, but primarily
with with decision trees/rules or with bayesian belief networks.
The candidates should be familiar with statistical theory and have practical
experience with databases.
Excellent coding skills in C/Java/Unix environment along with
good system maintenance practices and the ability to
quickly pick up new systems and languages are needed.
The candidates should also have good communication skills, be
able to work in a team, and be able to enjoy the exciting atmosphere of
a start-up company.
Most of all, candidates should have the passion for developing and
applying innovative methods for solving practical problems.
We offer very competitive salaries, and our outstanding benefits include
profit sharing, stock options, medical/dental insurance, and a 401(k)
plan.
The data mining branch of the company is conveniently located in the
Cambridge area, easily accessible by public transportation.
Proper work authorization required.
Please email your resume and a cover letter (in plain ASCII, please) to:
Gregory Piatetsky-Shapiro, Ph.D.
Director of Applied Research
Geneve Consulting Group
545 Concord Ave
Cambridge MA 02138
email: gps@genevecon.com
tel: 617-661-1358
fax: 617-491-4936
URL:
Previous6NextTop
Subject: Data Mining Research Position possibility.
Date: Sat, 26 Apr 1997 11:57:24 +0100
From: Donal Lyons (dlyons@stats.tcd.ie)
Currently there is EU funding available for experienced researchers to
spend a year in countries such as Ireland. I wish to explore the
possibility of using this funding to help develop a Data Mining Interest
Group within the School of Systems and Data Studies in Trinity College,
Dublin.
I'd like to discuss this further with any experienced EU researchers who
are at least tentatively interested.
Regards,
Donal.
Donal Lyons, Phone (1000-1700 GMT) +353 1 608 1919
Lecturer (Information Systems) Phone Messages +353 1 608 1767
School of Systems & Data Studies
Trinity College, Dublin 2, FAX on request
Ireland.
Previous7NextTop
Date: Mon, 5 May 97 11:48 BST
From: Yike Guo (yg@doc.ic.ac.uk)
Subject: Job in Japan
A Fujitsu subsidiary company which is developing OLAP and datamining tools
is now looking for a foreign engineer who is interested in working in Japan.
Carrier opportunity for a programing engineer in Japan
Duties
Designing and programing data mining products which include
a visualizing OLAP client.
Requirements
- BS or MS degree related to computer science
- C programming skill (VC++ on NT background is best)
- Familiarity with datamining, visualization, or OLAP
- Native English speaker
Contact
Fujitsu SWE, Manager Mr. Katoh
E-mail: hiromi@swe.fujitsu.co.jp
Under the auspices of the
Portuguese Conference on Artificial Intelligence (EPIA'97) Coimbra,
Portugal, October 6-9, 1997
October, 7-8, 1997
Coimbra University Physics Building
Aims of the Workshop
This workshop is in the area of Extraction (or Discovery) of Knowledge from
Data Bases and Data Mining, which are rather recent but expanding
rapidly. The objective of the workshop is to discuss methods for non-trivial
extraction of information which is implicit in the existing data and which
can be represented in a high-level language so as to facilitate interpretation.
EKBD'97 welcome original papers in English on the following topics:
- Machine Learning methods useful in KDD and Data Mining,
(decision tree /rule induction, relational learning (ILP) etc.)
- Statistical methods useful in KDD and Data Mining,
(multivariate analysis, principle components, clustering, regression
methods etc.),
- Reduction of complexity through preprocessing,
(identification of relevant attributes, data sampling, clustering, etc.),
- Data summarization and consolidation,
- Languages useful in describing user's hypotheses,
- Applications of KDD and Data Mining,
- other related areas of interest.
Workshop Format and Attendance Requirements:
The workshop will include invited talks, paper presentations and a panel
discussion. The workshop will last 1-2 days.
Papers in English, with no more than 15 pages are welcome.
Attendees should be registred to the main EPIA conference.
(see
Submit 3 copies of the full paper to the address below:
Pavel Brazdil
LIACC, Universidade do Porto,
R. Campo Alegre, 823,
4150 PORTO, PORTUGAL
Text format should follow Springer Verlag Lecture Notes Series.
English is the official language of the workshop.
Important dates:
June, 16: submissions due
July, 15: notifications sent
September, 8: final versions due
Programme Committee:
Pavel Brazdil, Univ.Porto (chair)
Arlindo Oliveira, IST
Carlos Bento, U. Coimbra
Ernesto Costa, U. Coimbra
Fernando Moura-Pires, UNL-FCT
Fernando Nicolau, UNL-FCT
Helena Bacelar Nicolau, UNL-FCT
Joaquim Pinto da Costa, Univ. Porto
Paulo Azevedo, Univ. Minho
Paula Brito, Univ. Porto
Paulo Gomes, INE, Porto
Organizing Committee:
Pavel Brazdil (chair)
LIACC, Universidade do Porto, R. Campo Alegre, 823,
4150 PORTO, PORTUGAL
email: pbrazdil@ncc.up.pt
Tel.: (02) 600 1672, Fax: (02) 600 3654
Fernando Moura-Pires
UNL-FCT, Dept. Informatica, Quinta da Torre
2825 Monte da Caparica, PORTUGAL
email: fmp@fct.unl.pt
Tel.: (01) 295 4464, Fax: (01) 295 5641
Previous9NextTop
Subject: IDA Call for Participation
Date: Thu, 8 May 1997 17:43:12 +0200
From: Michael Berthold (berthold@ira.uka.de)
CALL FOR PARTICIPATION
The Second International Symposium on Intelligent Data Analysis (IDA-97)
Birkbeck College, University of London
4th-6th August 1997
In Cooperation with
AAAI, ACM SIGART, BCS SGES, IEEE SMC, and SSAISB
You are invited to participate in IDA-97, to be held in the heart of London.
IDA-97 will be a single-track conference consisting of oral and poster
presentations, invited speakers, demonstrations and exhibitions. The
conference Call for Papers introduced a theme, 'Reasoning About Data',
and many papers complement this theme, but other, exciting topics have emerged,
including exploratory data analysis, data quality, knowledge discovery and
data-analysis tools, as well as the perennial technologies of classification
and soft computing. A new and exciting theme involves analyzing time series
data from physical systems, such as medical instruments, environmental data
and industrial processes.
Information regarding registration as well as the preliminary technical
program can be found on the IDA-97 web page (address listed above). Please
note that there are reduced rates for early registration (before 2nd June).
Also there are still a limited number of spaces available for exhibition,
and potential exhibitors are encouraged to book early (the application
deadline is 2nd June). Previous10NextTop
From: 'Staal Vinterbo' (pkdd97@idi.ntnu.no)
Message-Id: (9705061805.ZM4513@or.idt.unit.no)
Date: Tue, 6 May 1997 18:05:56 +0200
X-Mailer: Z-Mail (3.2.1 10oct95)
To: kdd@gte.com
Subject: PKDD'97 Call for participation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Status: U
X-Mozilla-Status: 0001
Content-Length: 4951
Dear Sir.
I am asking on behalf of Prof. Komorowski that the following call for
participation is distributed via the kdd nuggets mailinglist.
Thank you.
PKDD'97 -- Call For Participation
1st European Symposium on Principles of
Data Mining and Knowledge Discovery
Trondheim, Norway
June 24-27, 1997
Tutorials: June 24-25
Symposium: June 26-27
Data Mining and Knowledge Discovery (KDD) have recently emerged from a
combination of many research areas: databases, statistics, machine
learning, automated scientific discovery, inductive programming,
artificial intelligence, visualization, decision science, and high
performance computing.
While each of these areas can contribute in specific ways, KDD focuses on
the value that is added by creative combination of the contributing areas.
The goal of PKDD'97 is to provide a European-based forum for interaction
among all theoreticians and practitioners interested in data mining.
Fostering an interdisciplinary collaboration is one desired outcome, but
the main long-term focus is on theoretical principles for the emerging
discipline of KDD, especially those new principles that go beyond each of
the contributing areas.
Previous11NextTop
From: tibs@utstat.toronto.edu
Date: Sun, 4 May 97 12:10 EDT
Subject: Modern Regression and Classification course - New York
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ +++
+++ Modern Regression and Classification: +++
+++ +++
+++ Statistical prediction methods for finance +++
+++ and marketing +++
+++ +++
+++ +++
+++ New York City: June 23-24, 1997 +++
+++ +++
+++ Trevor Hastie, Stanford University +++
+++ Rob Tibshirani, University of Toronto +++
+++ +++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This two-day course will give a detailed overview of statistical models
for regression and classification. Known as machine-learning in
computer science and artificial intelligence, and pattern recognition
in engineering, this is a hot field with powerful applications in
finance, science and industry.
This course covers a wide range of models from linear regression
through various classes of more flexible models to fully nonparametric
regression models, both for the regression problem and for
classification.
This special version of our popular MRC course is tailored to financial
and marketing professionals.
Although a firm theoretical motivation will be presented, the emphasis
will be on practical applications and implementations, especially in
the finance and marketing areas. The course will include many examples
and case studies, and participants should leave the course well-armed
to tackle real problems with realistic tools. The instructors are at
the forefront in research in this area.
After a brief overview of linear regression tools, methods for
one-dimensional and multi-dimensional smoothing are presented, as well
as techniques that assume a specific structure for the regression
function. These include splines, wavelets, additive models, MARS
(multivariate adaptive regression splines), projection pursuit
regression, neural networks and regression trees. All of these can be
adapted to the time-series framework for predicting future trends from
the past.
The same hierarchy of techniques is available for classification
problems. Classical tools such as linear discriminant analysis and
logistic regression can be enriched to account for nonlinearities and
interactions. Generalized additive models and flexible discriminant
analysis, neural networks and radial basis functions, classification
trees and kernel estimates are all such generalizations. Other
specialized techniques for classification including nearest- neighbor
rules and learning vector quantization will also be covered.
Apart from describing these techniques and their applications to a wide
range of problems, the course will also cover model selection
techniques, such as cross-validation and the bootstrap, and diagnostic
techniques for model assessment.
Software for these techniques will be illustrated, and a comprehensive
set of course notes will be provided to each attendee.
Additional information is available at the Website:
We see signs today that the Web is moving toward an environment where
new social and collaborative interactions are being realized. Rather
than continuing to evolve as a single-user environment, the Web is
beginning to be regarded as an environment where reciprocity and
awareness of othersP2 activities have an important function. Software
agents can help develop and support the process of reciprocity by
helping people find others with similar interests, and helping match
knowledge to the right people. Agents can also help people collectively
construct knowledge, shaped around their needs.
This full-day workshop is intended for designers and researchers from
academia and industry to discuss the role of agents in dealing with
social information. How can social agents be integrated into
collaborative relationships so that information and expertise can be
distributed and matched to the right people, where appropriate
relationships can be developed, and where collective knowledge can be
established?
Participation requires the submission of an input paper (3-6 pages) that
should try to address the points described above, from any of the
following aspects:
-experiences with agent use in collaboration
-design of agent systems
-application areas
-interface design
The paper should be sent for review by June 15 to:
Thomas Kreifelts
GMD-FIT.CSCW
D-53754 Sankt Augustin
Germany
Email: kreifelts@gmd.de
Fax: +49-2241-142084
Electronic submission is encouraged, HTML being the preferred format.
The selection of participants will be based on the input papers.
Accepted participants will be notified before the end of June so that
they can take advantage of early registration by July 1. For those who
are interested in submitting a paper to the workshop, but are not able
to meet the June 15 deadline, please contact the organizers as soon as
possible expressing your interest to participate in the workshop. The
accepted input papers will be distributed electronically in advance to
the workshop participants. The workshop will be structured around the
presentation of selected input papers to stimulate the discussion. Note
that participation in the workshop requires participation in the ECSCW
97 conference.
Important Dates:
----------------
June 15, 1997 - Deadline for submissions
end of June - Notification of acceptance
...July 1, 1997 - Early registration deadline for the ECSCW '97
conference