KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Data Mining and Knowledge Discovery Nuggets 96:26, e-mailed 96-08-13

News:

* S. Hedberg, End users of KDD applications -- Reply ASAP please!

* GPS, PR NewsWire: DCI Data Warehouse World, Data Mining Gold Rush,...

* M. Smyth, Intensive Tutorial in Learning Methods ...,

http://www.ai.mit.edu/projects/cbcl/web-pis/jordan/course/index.html
Publications:

* GPS, SYLLOGIC on Support for data mining algorithms in a relational

environment, http://www.syllogic.nl/art0001.html

* D. Aubrey, Mining for Dollars

* V. Raghavan, CFP: J. ASIS Special Issue on Data Mining,

http://www.usl.edu/~raghavan/JASIS97.html
Siftware:

* J. Brown, Complexity and Predictions,

http://www.hal-pc.org/~jpbrown/hmpg16.html

* N. Smith, Data preprocessing software for neural network users,

http://www.jurikres.com
Positions:

* K. Ali, Data Mining positions at IBM San Jose

--
Data Mining and Knowledge Discovery community,
focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Have ever stopped to think and forgot to start again?
Anonymous

Previous 1 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Mon, 12 Aug 1996 16:28:25 -0700 (PDT)
From: Sara Hedberg (hedberg@halcyon.com)
Subject: Looking for KDD end users

WANTED: End users of KDD applications -- ASAP please!

I am working on an article for the upcoming data mining issue of IEEE's
Intelligent Systems (formerly Expert) magazine. My article will focus on
users of KDD applications to learn more about their experiences with this
technology. (Here I mean the real end users, not developers who use KDD
tools.) Ideally, these users would have had previous experience with
statistical tools as a point of comparison, although this is by no means a
requirement.

My deadline for the article is August 23rd, so I'm anxious to get moving
on this.

If you have suggestions, please send me the following information:

Name of User:
Contact Information (phone, fax, email):
Brief Description of the system and user's role (if you have time):

Thank you in advance,
Sara Hedberg

Previous 2 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 13 Aug 1996 17:00:16 -0400
From: Gregory Piatetsky-Shapiro (gps0@gte.com)
Organization: GTE Laboratories
Subject: PR NewsWire: DCI Data Warehouse World, Data Mining Gold Rush,...

This article comes from Ziff-Davis Personal View, but I could not
find a direct http pointer. -- GPS

Data Mining Gold Rush, New Products, Press Briefings Will Highlight Dci Data...

Received: August 12, 1996 03:47pm EDT From: PR Newswire

Data Mining Gold Rush, New Products, Press Briefings
Will Highlight DCI Data Warehouse World

New York Show is Premier Event in Data Warehousing Industry

Note to IT Business, Financial, Data Warehouse, Internet/Intranet and
Horizontal Editors: The following release contains details of the Data Mining
Gold Rush at the upcoming DCI Data Warehouse World in New York, August 13-15,
as well as a listing of companies announcing new products at the show:

The Second Data Mining Gold Rush(TM), as well as a number of new product
announcements and press briefings, will be among the highlights of the DCI
Data Warehouse World at the New York Hilton and Towers, August 13-15, 1996.
Conceived by DCI and META Group, the Gold Rush -- the first one of which
was held at the recent DCI Data Warehouse World in Santa Clara, California --
is an exercise in data mining, one of the hottest areas of data warehousing.
As in the initial Gold Rush (the first-ever data mining exercise held during a
trade show), information surveys will be compiled by several hundred top-level
executives attending the show. The information will then be compiled by META
Group and Market Perspectives, who will subsequently turn it over to three
companies -- Datamind, SAS Institute, Inc., and Silicon Graphics -- for
analysis. The automated knowledge discovery of this data analysis is expected
to reveal extensive demographic and psychographic information about conference
attendees.
With data mining poised to become such a widely utilized technology, this
data mining exercise is particularly timely, according to Aaron Zornes,
Executive Vice President of META Group and DCI Data Warehouse World Chairman.
'Data mining is becoming a huge competitive advantage for businesses in
all markets,' he said. 'Companies that implement successful data mining
strategies will be at the forefront of their industries.'
The show will also serve as a forum for the announcement of at least six
new data warehousing products. Scheduled to make announcements are Planning
Sciences, Inc., Platinum Technology, Postalsoft, SAS Institute, Inc, Silicon
Graphics, and ZYGACORPORATION.
In addition, a total of eight press briefings will be held on a variety of
subjects related to data warehousing, each led by leading industry experts.
Members of the media interested in registering for the show should call
Dave Costello in the New York Hilton Press Room at 212-261-6180.
Digital Consulting, Inc., located in Andover, Massachusetts, is a world
leader in high-technology education, trade shows, and management consulting.
META Group, Inc., headquartered in Stamford, Connecticut, is an
independent market assessment company in information technology.
In addition to this show, DCI and META Group co-sponsored the Santa Clara
version of this event and will co-sponsor DCI's Data Warehousing Conference,
September 17-19, 1996, to be held at the Phoenix Civic Plaza.

CONTACT: Mike Littlewood of Gray & Rice PR, 212-261-6180.

SOURCE Digital Consulting, Inc.
-0- 8/12/96
/PRNewswire -- Aug. 12/

CO: Digital Consulting, Inc.; META Group, Inc.
ST: New York, Massachusetts, Connecticut
IN: CPR
SU:

SB-LZ
-- NEM038 --
5070 08/12/96 14:56 EDT http://www.prnewswire.com

Previous 3 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: marney@ai.mit.edu (Marney Smyth)
Subject: Intensive Tutorial in Learning Methods for Prediction, Classification...
Date: Tue, 6 Aug 1996 20:01:00 -0400 (EDT)

**************************************************************
*** ***
*** Learning Methods for Prediction, Classification, ***
*** Novelty Detection and Time Series Analysis ***
*** ***
*** Cambridge, MA, September 20-21, 1996 ***
*** Los Angeles, CA, December 14-15, 1996 ***
*** ***
*** Geoffrey Hinton, University of Toronto ***
*** Michael Jordan, Massachusetts Inst. of Tech. ***
*** ***
**************************************************************

A two-day intensive Tutorial on Advanced Learning Methods will be held
on September 20 and 21, 1996, at the Royal Sonesta Hotel, Cambridge, MA,
and on December 14 and 15, 1996, at Loews Hotel, Santa Monica, CA.
Space is available for up to 50 participants for each course.

The course will provide an in-depth discussion of the large collection
of new tools that have become available in recent years for developing
autonomous learning systems and for aiding in the analysis of complex
multivariate data. These tools include neural networks, hidden Markov
models, belief networks, decision trees, memory-based methods, as well
as increasingly sophisticated combinations of these architectures.
Applications include prediction, classification, fault detection,
time series analysis, diagnosis, optimization, system identification
and control, exploratory data analysis and many other problems in
statistics, machine learning and data mining.

The course will be devoted equally to the conceptual foundations of
recent developments in machine learning and to the deployment of these
tools in applied settings. Case studies will be described to show how
learning systems can be developed in real-world settings. Architectures
and algorithms will be presented in some detail, but with a minimum of
mathematical formalism and with a focus on intuitive understanding.
Emphasis will be placed on using machine methods as tools that can
be combined to solve the problem at hand.

WHO SHOULD ATTEND THIS COURSE?

The course is intended for engineers, data analysts, scientists,
managers and others who would like to understand the basic principles
underlying learning systems. The focus will be on neural network models
and related graphical models such as mixture models, hidden Markov
models, Kalman filters and belief networks. No previous exposure to
machine learning algorithms is necessary although a degree in engineering
or science (or equivalent experience) is desirable. Those attending
can expect to gain an understanding of the current state-of-the-art
in machine learning and be in a position to make informed decisions
about whether this technology is relevant to specific problems in
their area of interest.

COURSE OUTLINE

Overview of learning systems; LMS, perceptrons and support vectors;
generalized linear models; multilayer networks; recurrent networks;
weight decay, regularization and committees; optimization methods;
active learning; applications to prediction, classification and control

Graphical models: Markov random fields and Bayesian belief networks;
junction trees and probabilistic message passing; calculating most
probable configurations; Boltzmann machines; influence diagrams;
structure learning algorithms; applications to diagnosis, density
estimation, novelty detection and sensitivity analysis

Clustering; mixture models; mixtures of experts models; the EM
algorithm; decision trees; hidden Markov models; variations on
hidden Markov models; applications to prediction, classification
and time series modeling

Subspace methods; mixtures of principal component modules; factor
analysis and its relation to PCA; Kalman filtering; switching
mixtures of Kalman filters; tree-structured Kalman filters;
applications to novelty detection and system identification

Approximate methods: sampling methods, variational methods;
graphical models with sigmoid units and noisy-OR units; factorial
HMMs; the Helmholtz machine; computationally efficient upper
and lower bounds for graphical models

REGISTRATION

Standard Registration: $700

Student Registration: $400

Registration fee includes course materials, breakfast, coffee breaks,
and lunch on Saturday.

Those interested in participating should return the completed
Registration Form and Fee as soon as possible, as the total number of
places is limited by the size of the venue.

ADDITIONAL INFORMATION

A registration form is available from the course's WWW page at

http://www.ai.mit.edu/projects/cbcl/web-pis/jordan/course/index.html

Marney Smyth
CBCL at MIT
E10-034D
77 Massachusetts Avenue
Cambridge, MA 02139
USA

Phone: 617 258-8928
Fax: 617 253-2964
E-mail: marney@ai.mit.edu

Previous 4 Next Top

>~~~Publications:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 06 Aug 1996 13:02:46 -0400
From: Gregory Piatetsky-Shapiro (gps@gte.com)
Subject: An article by SYLLOGIC on
Support for data mining algorithms in a relational environment

(see http://www.syllogic.nl/art0001.html

I enclose the following interesting article I found. I do not necessarily
agree with all of it, but it can stimulate interesting ideas.

-- Gregory
P.S. many part, however, cannot be read without HTML browser.

Support for data mining algorithms in a relational environment (1)

From: Data Warehouse Report, Volume 6, Summer 1996.

Published with kind permission from Data Warehouse Network

PO Box 7, Skibbereen, Cork, Ireland.

Tel +353 28 38483.

Fax +353 28 38485.

© Data Warehouse Network, 1996, Ireland.

Written by:

WOUTER SENF and PIETER ADRIAANS

The creation of large data warehouses and data mining environments will have a deep influence on database technology.
Data warehousing and data mining involve a shift from viewing database systems as plain administrative records to viewing
them as a production factor. Databases are seen as potential sources of new information. Knowledge discovery in databases
(KDD) is defined as the extraction of implicit, previously unknown, and potentially useful information from databases.
This article shows how bit-mapped indexes can speed up data mining search algorithms, in particular for building decision
trees and finding association rules.

In order to fulfil the needs of a KDD environment, adaptation of traditional relational technology is necessary. In
contrast to common beliefs, data mining is first and foremost a server technology. Current-generation data mining tools
are mainly client tools, with attractive graphical user interfaces, that use flat files or relational databases as input.
The performance of these tools on truly large data sets is poor, and will not improve unless the underlying database
technology is adapted.

The current relational technology with its stress on efficiency of updates and exact identification of records is ill
suited for the support of data mining algorithms. Pattern recognition algorithms ask for efficient sampling methods,
storage of intermediate results during execution of queries, flexible user-defined functions, bit-mapped operations,
and geometric indexes that allow users to search in the neighbourhood of a specific record. All these functions can in
principle be implemented on top of the existing database management systems since they are based on generalisations of
the same mathematical structure: first order logic and relational algebra.

The actual pattern discovery stage of the KDD process is currently called data mining. At this moment, most data mining
tools are developed on top of traditional database platforms, and use SQL as a query language. There are indications,
however, that this situation is far from ideal:

Data mining involves very large volumes of data. The normal SQL interfaces to these data sets are simply too slow
for the demands of the average data mining algorithm.
Database access for pattern recognition algorithms is atypical. There may, for example, be a need to reuse query
results (zoom-in). Ideally such algorithms, which can be seen as generalisations of notions of relational theory, have
access to the low level functionality of the database management system; if not, the client that runs the pattern
recognition process needs to perform additional database-like functionality with the need to access individual records.
From software engineering and performance perspectives, this is undesirable.

Data mining

Traditional decision support tools are used to verify hypotheses that have been posed by business experts. In contrast,
data mining tools are used to generate hypotheses. Such hypotheses will be verified against the data. As a result, a set
of rules that were previously unknown may emerge.

Data mining is not a single technique: any technique that will help to get more information out of data is useful.
Therefore, data mining techniques form a heterogeneous group. Different techniques are used for different purposes. Some
of the more interesting data mining techniques are:

Decision trees, which consist of nodes and branches, starting from a single root node. Each node represents a test
or decision. Depending on the outcome of the decision, one chooses a certain branch. When a terminal node (or leaf) is
reached, a decision on a class assignment is made.
Association rules, which state a statistical correlation between the occurrence of certain attributes in a database
table. The general form of an association rule is Xl,...,Xn => Y. This means that the attributes Xl,...,Xn predict Y.
Genetic algorithms, which are a well-known variant of evolutionary computation. This is an approach to the design of
learning algorithms that is structured along the lines of the theory of evolution. A collection of potential solutions for
a problem compete with each other. The best solutions are selected and combined with each other like a 'survival of the
fittest' strategy.
Neural networks, which are a class of learning algorithms that imitate the structure of biological nervous systems.

For this article, we use a sample database consisting of records containing subscription data for magazines. It is a
selection of the operational data from the invoicing system of a publisher. It contains information about people that have
subscribed to a magazine. After coding and cleaning, the records consist of a client number, age, income, credit,
information concerning car and house ownership, area code, and five binary values indicating the type of magazines to
which the customer has subscribed.

Decision trees

Classification and prediction (for example, using a table of data on customer behaviour) are closely related. Predicting a
certain customer's behaviour implies the assumption that the customer belongs to a customer group with typical behaviour.

Our database contains attributes like age, income, and credit. If we want to predict customer behaviour, we might
investigate which of these attributes provides most information. If we want to predict who will buy a car magazine, what
would help us more; information about the age or information about the income of a person? It could be that age is more
important. This will mean that we may be able to predict whether or not he or she will buy a car magazine based on
knowledge of age only.

If this is the case, we can split this attribute in two. That is, we must investigate whether there is a certain age
threshold that separates car magazine buyers from non-car magazine buyers. The split-function determines this threshold
value. In this way, we could start with the first attribute, find a certain threshold, go on to the next one, find a
certain threshold, and repeat this process until we have made a correct classification for our customers, thus creating a
decision tree for our database.

There are many algorithms that build such decision trees automatically. They are very effective, since they have (nlog n)
complexity.

Figure 1: A simple decision tree for the car magazine (target column = CAR_MAGAZINE and depth = 2)

Figure 1 shows the result of applying a tree induction algorithm to our data set. We are interested in a description of
the readers of our car magazine: we need to build a decision tree that tells us exactly what type of customers would be
interested in such a magazine. A tree of depth 1 gives the a priori chance that people buy car magazines, in this case
33%. For a tree of depth two, age appears to be the most decisive attribute. The threshold lies at 44.5 years. Above this
age only 1% of the people buy a car magazine, below this age 62% of the people have a subscription.

Association rules

Marketing managers are fond of rules such as: 90% of women with red sports cars and small dogs wear Chanel No 5. Such
descriptions give them clear customer profiles from which to target their marketing actions. In data mining, this type of
relationship is called an association rule and there are many techniques for finding association rules.

Suppose we have a database with information on the gender of customers, the colour and type of their car, the type of pets
they have, and a number of products they are likely to buy. The rule that was mentioned above would reflect itself in such
a database in the following way:

For 90% of the records where gender is female, car is sports car, colour of car is red, and pet is a small dog, then
perfume would be Chanel No 5.

Association rules are always defined on binary attributes, like the ones we used in our sample database to represent
subscriptions to magazines. So we have to flatten the table mentioned above before we can execute an association algorithm.
This is illustrated in Figure 2: on the left the original table is shown, on the right the flattened version of the table
is shown.

`Customer`	`Area`	`Age-group`	`Customer`	`Area #1`	`Area #2`	`Area #3`	`Area #4`	`Young`	`Medium`	`Old`
1	1	young	1	1	0	0	0	1	0	0
2	4	old	2	0	0	0	1	0	0	1
3	2	old	3	0	1	0	0	0	0	1
4	3	young	4	0	0	1	0	1	0	0
5	3	medium	5	0	0	1	0	0	1	0
6	2	old	6	0	1	0	0	0	0	1
7	1	young	7	1	0	0	0	1	0	0

Figure 2: An example of flattening a table

It is not very difficult to develop algorithms that will find this association in a large database. The problem, however,
is that such an algorithm will also find a lot of other associations that are of very little value. There are not many
women with red sports cars and small pets, so this is a very small sub-set of our customers. We will find such a small
sub-set only when we have a large database of clients at our disposal. Yet, the number of possible association rules that
we can find in such a database is almost infinite.

The problem with association rules is that one is bound to find so many associations that it will be very difficult to
separate valuable information from mere noise. It is therefore necessary to introduce some measures to distinguish
interesting associations from non-interesting ones. We will represent an association rule in the following

attributei ( ,attributej ...) => target-attribute (confidence, support)

For example:

MUSIC_MAG, HOUSE_MAG => CAR_MAG (97%,9%)

Interesting associations are those with many examples in the database. We call this the support of an association rule. In
our case the support of the rule is: the percentage of records for which MUSIC_MAG, HOUSE_MAG and CAR_MAG all hold; that
is, all the people that read all three magazines.

Support in itself is not enough however. A considerable group of people may read all three magazines, but a much larger
group may read MUSIC_MAG and HOUSE_MAG, but not CAR_MAG. In this case the association is weak, although the support might
be relatively high. We need an additional measure: that is, confidence. In our case the confidence is the percentage of
records for which CAR_MAG holds, within the group of records for which MUSIC_MAG and HOUSE_MAG hold.

In our example, with five attributes for the five magazines, there are 25 possible unary association rules, some of which
are trivial. This number grows rapidly if we allow multiple attribute associations. As in the case of decision trees, it
will be better to use an environment that enables us to zoom in on interesting sets of association rules interactively
using zoom scan algorithms.

	`Sports-magazine (36%, 45%)`
	`Music-magazine (96%, 15%)`
	`Comic-magazine (57%, 8%)`
`Target Minimal confidence Minimal support A priori`	`Car-magazine 30% 3% 30%`

Figure 3: Binary associations for the car magazine

Figure 3 illustrates such an environment. We have selected CAR_MAG as our target attribute; that is, we are interested in
readers of the car magazine. The confidence and support levels are set at 33% and 3%. This means that we will not be
interested in sub-groups smaller than 3% of the database and that within these sub-groups we want to find associations
that hold for at least 33% of the records.

In the first stage of our investigation all the relevant attributes are analysed. Apparently with these confidence and
support levels the algorithm finds no association for the house magazine.

The association between music-magazines and car-magazines, is the most interesting since it has a high confidence level
(96%) with a fairly high support (15%).+

The second part of this article, which discusses the implementation of decision trees and association rules using bitmaps,
will be published in the next issue.

About the authors:

WOUTER SENF is technical leader of the decision support group at Tandem's High Performance Research Centre where he is closely involved in large data warehousing projects in Europe and with new software developments at Tandem. Prior to joining the HPRC, he worked for 6 years as a consultant at Tandem Netherlands BV where he specialised in the areas of large databases and performance.

PIETER ADRIAANS has been active in research in the areas of artificial intelligence and relational database systems since 1984. He is a director at Syllogic, where he is responsible for the development of tools for the management of client-server systems and databases specialising in the integration of artificial intelligence techniques, machine learning, object orientation, and management systems.

Previous 5 Next Top

>~~~Publications:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 06 Aug 1996 13:15:48 -0400
From: Gregory Piatetsky-Shapiro (gps0@gte.com)
Subject: David Aubrey 'Data Mining for Dollars'
I found this article using Ziff-Davis Personal View service,
I think it was published in Computer Shopper, Aug 1996

Mining for Dollars

Data mining's cutting-edge technologies revamp your databases--and bring new profit to your company

David Aubrey

While the practice of corporate data warehousing and data mining is receiving a lot of hype, there's considerable confusion about what it is and who should be using it. The vague, even proprietary terminology for describing this sophisticated approach to storing and retrieving database information hides the serious technology that comprises this practice. This leads to questions such as, "Isn't data warehousing just a glorified term for database management?" To give you the real answer, we'll first define data warehousing and data mining, then explore the advantages these methods bring to corporations.

In the most simplistic definition, data warehousing and data mining are complementary methods of applying various technologies to use the information stored in a company's existing database more effectively. In a broad sense, these technologies are the most dramatic utilization of a PC's raw computational power.

But there is controversy surrounding the effectiveness of data warehousing and data mining. Skeptics say that data warehousing is actually an expensive step backward, disguised as a step forward. They describe it as stuffing all the data contained in a company's many small databases into one large database, which is then managed by a trained staff and accessed by users through a friendly front end. The detractors see this as revisiting the age of mainframes and dumb terminals. They also say the sheer volume of database records being warehoused would inevitably drag down productivity; lost files and records would take up valuable storage space while important data would remain largely unused.

This is a glib view, and on closer examination, it's also inaccurate. Data warehousing is much more than simply dumping piles of files into a single database. Properly executed, it has the potential to be the most sophisticated example of client/server software architecture to date.

Clearing the Air

Let's start with an accurate definition of a data warehouse. A data-warehouse environment is a subject-oriented data collection used mainly to aid in organization-wide decision making. This is why many IS managers have adopted the term "decision support" in place of data warehousing, since a decision-support construct can be viewed as a specialized database that is maintained separately from your organization's operational databases.

But why isn't the traditional database technology, used for operational databases, sufficient enough? Well, that's the nub of the warehousing question and the reason the warehousing industry is spewing out new buzzwords at a ridiculous--and even careless--rate. Regular, or operational, databases weren't designed with data-warehousing applications in mind. Their primary function is to serve as a data-processing center for business support. All data stored in an operational database is based on the software process required to input and process that data. The database's innards are highly structured and repetitive, and accessing the data requires complicated data-entry procedures and batch-processing or online-transaction-processing (OLTP) queries.

This is fine for storing a fairly large number of records, with the ability to retrieve them for future use. What sets a data-warehousing operation apart is that it's designed to support not just the storage and retrieval of records, but also the actual manipulation and cross-fertilization of that data to help users make more-informed decisions. To do this, a decision-support system has to be built using an entirely different foundation than an operational database.

Alternative Architectures

A large number of the buzzwords claim to be the backbone of a data warehouse, but most are simply the tools used to get the most out of a decision-support system. On an architectural level, there are two primary technologies vying for attention: multidimensional database (MMD) technology and relational online analytical processing (ROLAP).

Where standard operational databases, often based on OLTP, store records in a two-dimensional architecture generally called tables, data warehousing involves much more. For example, an MMD-based construct arranges its records in an N-dimensional "cube." Essentially, the database performs a large number of precalculations for all the multidimensional views of its cube and stores them as part of the cube for later use, when users access or cross-reference the data. That way, when a user calls for data in one of these multidimensional views, it's retrieved much faster than from a two-dimensional system where the database would need to spend a considerable amount of time scanning its relational tables.

The ROLAP design, however, blends powerful querying tools with third-party optimization software, which is used in conjunction with your existing relational database-management systems. This creates a multilevel architecture that lets the ROLAP client see multidimensional views while keeping the database calculation engine, the "metadata"--the predefined elements of your data warehouse--and all the security code on the ROLAP server.

The object behind a data warehouse is to link all your company's data to a single, user-accessible front end. Naturally, there are a variety of ways to do this, but we will discuss the three main methods here: using your existing operational databases with third-party products, setting up a virtual data warehouse, and using a discrete data warehouse.

If your corporation's MIS equipment is built around a standard, well-supported system architecture and comes from one company, such as DEC, Hewlett-Packard, or IBM, then it's possible to create a fairly inexpensive solution using your existing operational databases. Via customized database query engines from vendors such as Oracle or SQL, you can create a decision-support environment that doesn't require a separate metadata repository. While this approach is inexpensive (since you'll need neither a new database nor data-duplication methods), you'll almost certainly encounter serious performance and flexibility problems when trying to run evolving decision-support queries.

This hybrid solution simply enhances your existing databases by adding data tables dedicated solely to decision support within the operational database. Operational data must then be separated from historical data and organized by subject in those special tables. This will greatly reduce locking conflicts within OLTP applications. Again, while this is certainly a feasible solution, it's best left for a departmental data warehouse, since an enterprise-wide solution would stress a single OLTP environment far too much.

Then there's the virtual data warehouse. This is an alternative for companies that want to expand the previous single-OLTP-based concept over several distributed databases. The virtual data warehouse depends on a piece of software nebulously known as middleware. This ever-changing category of software basically bridges end users' querying tools to the physical databases. In this situation, the user simultaneously accesses multiple databases on multiple systems, but to the user, it seems as if everything is functioning as a single data warehouse. Again, however, the potential drawback here is that the operational databases aren't optimized for decision-support querying. The lack of standardization among different database platforms is another obstacle.

Finally, the only true data-warehouse implementation is called the discrete data warehouse. This system is composed of a separate, discrete database dedicated to decision-support querying and traffic activity. It's populated only with data consistent with true data-warehousing criteria, which is discussed below. While this is certainly the most rewarding approach for an enterprise-wide decision-support platform, it's also the most involved to construct. Typically, building a discrete warehouse requires the involvement of all your users. They must identify the data they require in their daily business activities, help design an appropriate data model, and help create the extraction and cleansing routines.

The Way to Go

Selecting the appropriate data-warehouse architecture cannot be based on static performance measurements. You've got to compare performance characteristics with your business process. The real key here is understanding how your business works and what it requires of your data-storage engine to facilitate decision making. That leaves a major gray area wide-open. But within the chaos, you can cling to a few rules when designing an engine.

At a basic level, data warehouses share four fundamental characteristics: They are subject-oriented, integrated, time-variant, and nonvolatile. Regular operational databases, such as order processing and manufacturing, are organized around a single, static business application. This causes companies to store identical information in multiple locations, resulting not only in wasted time and storage space, but also in inaccurately updated information. By contrast, a data warehouse is subject-oriented, meaning it is organized around specific subjects. Subject organization presents the data in a format easier for end users to understand and manipulate more creatively.

Data-warehousing systems are also integrated. Data integration is perhaps one of the trickiest facets of the operation. It is accomplished by dictating complete consistency in how data is formatted, named, stored, manipulated, and more. The political ramifications alone are daunting to any IS manager facing a large and varied corporate milieu. But if you surmount these problems, your data-warehouse information will always be maintained and accessed in a consistent way--which is critical to its success.

Since data warehouses hold and maintain historical and current data, they are considered to be time-variant. Operational databases, however, hold only the most up-to-date data. On a historical scale, data warehouses contain data gleaned from a company's operational databases on a daily, weekly, or even monthly basis and is then maintained from one to five years. This illustrates one of the major differences between the two database technologies: Unlike static data-processing environments, where only the latest record matters, historical information can be of high importance to corporate decision makers. It can be used to better understand business trends and relationships--a virtually impossible task for an operational database.

Finally, data warehouses are nonvolatile, which means that after the informational data is loaded into the warehouse, changes, inserts, or deletes are performed only rarely. Data loaded into the warehouse is actually "transformed" data that stems from the operational databases. The data warehouse reloads that data on a periodic basis and updates itself with transformed data from the operational databases. Apart from this loading process, the information contained in the data warehouse generally remains static. Nonvolatility lets a data warehouse be heavily optimized for query processing.

Digging in the Mine

Building and implementing a decision-support system is only the beginning. Holding countless interdepartmental meetings to discuss consistent data-storage criteria; choosing a multidimensional database engine; creating customized interdepartmental interfaces for that engine; configuring custom or third-party middleware packages to let that database engine communicate with your existing relational databases; and scheduling file transfers, updates, and backups simply get you off to a good start.

Once your system is all in place, it's got to actually do something. Now you must use an ever-growing list of software analysis tools to arrive at a synergy called data mining.

Data mining is designed to reach as deeply as possible into a data store, and the mining tools are designed to find patterns and infer rules from it. You can use those results to answer users' in-depth questions and perform forecasts. They can help speed analysis by focusing their attention on the most germane variables.

Generally, data mining can access five common information types: associations, classifications, clusters, forecasting, and sequences. Associations arise when occurrences are connected by a single event. Recognizing patterns that describe the group to which an item belongs is the basis of classification, probably the most common data-mining activity. Examining a set of items that have already been classified and inferring a set of rules from the classified items is a classification. Clustering is related to classification, but there are no predefined groups. Through clustering, a data-mining tool can segment warehouse data into distinct groups, again using those groups to make predictions and comparisons.

Tools of the Trade

While these information types are the primary results of a data-mining operation, a variety of tools can access those information types. Which ones you choose depends on your performance requirements, database platforms, data types, and overall business scenario.

One of the most popular tools right now is a neural network. (See the March 1996 article "Brain Waves," p. 566.) A neural net is software-based operation divided into a fairly large number of virtual nodes, each with its own inputs, outputs, and processing power. Underneath the initial input and output layers, a programmer can code a number of hidden processing layers. By comparing its output with a known outcome described by the net programmer, a neural net can adjust its processes to better adapt to its mission. But this is cutting-edge software development best left to the experts.

A decision tree is a different animal. Decision trees divide your warehouse data into groups, based on set values of their variables. They do this basically by creating a large hierarchy of if-then questions that serve to classify the data. While decision trees aren't as complex as a neural network, they're nevertheless sparking some interest. Their relative simplicity lets them be faster than neural nets on average, and they can be customized to specific business needs.

But they're not idiot-proof and can't work with all types of data. For instance, some decision trees have trouble dealing with data sets containing continuous sets of data such as sales-over-time figures. They require that these be grouped into ranges before the decision tree can perform. So, yes, a decision tree can be much easier to implement, use, and understand than a neural net, but depending on the business circumstance, an if-then statement can get hairy.

The Hardware Set

By now, it should be clear that data warehousing and mining comprise some of the most advanced software techniques available. As you might imagine, this kind of code won't run on just any hardware platform. In fact, warehousing and decision support are so demanding that they've rejuvenated the ultra-high-end server market.

Modular systems, or scaleable systems in the new lingo, are back in. With a scaleable system, once you buy the base platform, you can easily upgrade these machines via proprietary plug-in hardware modules, including processor cards, memory modules, and hard drives. These systems are also designed to let advanced client/server environments run with a minimal number of skilled or trained professional support people.

Data warehousing has also put symmetric multiprocessing (SMP) servers and even massively parallel processors (MPPs) back on the map. Basically, these machines have between two and four processors that use special operating systems such as Windows NT, OS/2 Warp, and NetWare SMP to divide up processing loads among themselves in the most efficient manner possible.

MPP systems extend the SMP paradigm, both in terms of the number of simultaneous processors and the degree to which they can communicate and share with one another. These systems are also usually based around high-end individual processor chips such as those found in the IBM RS/6000 SP, ICL Goldrush MegaServer, NCR 5100M, or Unisys OPUS.

Unfortunately, while SMP systems are taken care of by the operating system so a database can run over them without requiring optimization, MPP systems require special versions of database software. Right now, IBM, Oracle, and Sybase are some of the vendors supporting this technology, but not every vendor is on board.

Data warehousing and mining will affect all aspects of the corporate IS department--from software considerations and hardware requirements to interpersonal issues between IS and the other business departments. Everything is touched by its implementation. Have no doubt: Building one of these systems is intense. But you can use the following tips to avoid problems: Users pick only a select few data types on which they require decision support. Start small and perfect your system before attempting to grow. Remember that different departments will want different data types initially, and getting them all to agree on just a few can be a major hassle.

Overall design will depend on the most frequently accessed data elements and the most commonly required dimensional query (be it time, geography, or whatever). If any information has to be aggregated or summarized, it will have to be identified right away. Also at this stage, the metadata (criteria, rules, and so on) relating to the data you've selected to be contained in the warehouse must be defined; these elements will assist your users in understanding the warehouse.

Finally, there's warehouse maintenance. The procedures required to maintain your warehouse cannot be implemented on an as-needed basis. Change is inevitable, so the challenge for database managers is to come up with an effective maintenance plan right away that leaves enough room for flexible alterations to the system as time and technology march on.

Warehousing data and subsequently mining it for cutting-edge information are among the highest forms of corporate computing. This is what managers have wanted since computers first appeared in the business arena. A warehousing system is still expensive to set up and not always as easy to use as you might like. Remember, this is a complex undertaking, directed at large companies that require only the most-sophisticated technology, can staff the most-qualified personnel, and have the largest budgets. You'd be ill-advised to look at data warehousing and mining any other way. But the overall benefit is as rich as the collected data they hold.

A data-mining software tool is a key factor in accessing the cutting-edge information that your company desires.

Data mining for cutting-edge information is among the highest forms of corporate computing.

Previous 6 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Return-Path:
Date: Tue, 6 Aug 1996 16:55:32 -0400
From: Gregory Piatetsky-Shapiro (gps0@eureka)
To: kdd
Subject: [raghavan@cacs.usl.edu: help]
Content-Length: 1087

------- Start of forwarded message -------
Date: Thu, 1 Aug 1996 15:37:46 -0500
From: 'Dr. Raghavan' (raghavan@cacs.usl.edu)
Subject: help
Content-Length: 728

Dear Dr. GPS,
I am hoping this reaches you before you leave for the KDD conf. to
Portland. As you know, I will be (co)guest editing a special issue on
data mining for the Journal of American Society for Information Science.
I would like your help in making an announcement about that at the
conference (since, I am unable to attend).

The following URL has the complete call for papers:

http://www.usl.edu/~raghavan/JASIS97.html

If some of the attendees would be willing to serve as referees, they could
get in touch with me by e-mail at raghavan@ccacs.usl.edu.

With regards,
Vijay Raghavan

Previous 7 Next Top

>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: 'jpbrown' (jpbrown@hal-pc.org)
Organization: Ultimate Resources
Date: Tue, 6 Aug 1996 22:41:50 -0006
Subject: Complexity and Predictions

Predictions, while resolving the Complexity of a database, can
improve the Coefficients of Determination.

A new link on my Website, http://www.hal-pc.org/~jpbrown/hmpg16.html
will show you the results of an iterative application of Artificial Neural
Nets for predictions. The sub-sets produce a classification which can provide
causal explanations for the apparent complexity of the original
database.

Previous 8 Next Top

>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: 'Norman F. Smith' (nfsmith@artnet.net)
Subject: A Related Website
Date: Thu, 8 Aug 1996 14:24:17 -0700

Check out our website at http://www.jurikres.com.

We provide data preprocessing software to neural network users.

Our site also contains an amusing page concerned with 'snake oil' in the financial markets.

Norman Smith
Jurik Research Software

Previous 9 Next Top

>~~~Positions:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 6 Aug 1996 14:50:43 -0700
From: (ali@almaden.ibm.com) (Kamal Ali)
To: kdd@gte.com
Subject: Data Mining positions at IBM San Jose

IBM DATA MINING POSITIONS

Join the team! We are an entrepreneurial organization within IBM
developing Data Mining solutions. We have three groups, a consulting
services group, research & development group, and software application
development group. Currently, we are looking for qualified
individuals in the rapidly expanding consulting services group.

DATA MINING ANALYSTS/CONSULTANTS
Analysts will be responsible for performing consulting engagements in
any of the following areas: finance, insurance, retail, tele-
communications, media, and health care. There will also be
opportunities for teaching business data-mining classes and a few
opportunities for applied research for the kinds of problems that
arise from our data-mining engagements. Familiarity with databases,
statistics, data preparation, and high-end data mining techniques and
tools required. Familiarity with SAS and previous experience in
applying data-mining in a commercial context are big
pluses. Applicants must have advanced degrees in CS, Statistics, or
Mathematics - PhD preferred. Applicants should have good communication
skills, like working with people in a team environment, be willing to
travel and be application oriented. These positions provide excellent
customer contact with high level executives in FORTUNE 500 companies.

These positions will be located at our world class research lab -
Almaden Research Center - in sunny San Jose, CA. Almaden is located
in beautifully situated rolling hills in Silicon Valley affording
close contact with top universities such as Stanford University and
UC Berkeley.

For further information, please email your resume to myself
(ali@almaden.ibm.com) preferably in ASCII or Postscript format.
I've been working as an Analyst in IBM's data mining group since December
and it's been a great experience. Feel free to contact me with questions.
(408) 927-1354.

Also check out our web pages, which give some detailed examples of how
we've used our tools to build and visualize models and give
information on previous engagements we have had.
http://www.almaden.ibm.com/stss (click on 'Data Mining')

==============================================================================
Kamal Mahmood Ali, Ph.D. Phone: 408 927 1354
Consultant and data mining analyst, Fax: 408 927 3025
Data Mining Solutions, Office: ARC D3-250
IBM http://www.almaden.ibm.com/stss/
==============================================================================

Previous 10 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~