This lesson is a brief introduction to the field of Data
Mining (which is also sometimes called Knowledge Discovery). It is adapted
Module 1: Introduction, Machine Learning and Data Mining Course.
1.1 Data Flood
The current technological trends inexorably lead to data
More data is generated from banking, telecom, and other
More data is generated from scientific experiments in
astronomy, space explorations, biology, high-energy physics, etc.
More data is created on the web, especially in text, image,
and other multimedia format.
For example, Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which produces 1 Gigabit/second (yes,
per second !) of astronomical data over a 25-day observation session.
This truly generates an "astronomical" amount of data.
AT&T handles so many calls per day that it cannot store
all of the data - and data analysis has to be done "on the fly".
UC Berkeley analysis by Profs. Peter Lyman and Hal R. Varian
estimated that 5 exabytes (5 million terabytes) of new data was created in
2002. Twice as much information was created in 2002 as in 1999 (~30% growth
rate) . US produces ~40% of new stored data worldwide.
As of 2003, according to Winter Corp. Survey, France Telecom
has largest decision-support DB, ~30 TB (terabytes); AT&T was in second
place with 26 TB database.
Some of the largest databases on the Web, as of 2003,
The data amounts grow very fast
and very little of it will ever be looked at by a human.
- Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB
- Internet Archive (www.archive.org),~ 300 TB
- Google, over 4 Billion pages, many, many TB
Knowledge Discovery is NEEDED
to make sense and use of data.
1.2 Data Mining Application Examples
The areas where data mining has been applied recently
- drug discovery, ...
- Customer modeling and CRM (Customer Relationship management)
- fraud detection
- health care, ...
- telecom (telephone and communications),
- targeted marketing,
- search engines, bots, ...
- anti-terrorism efforts (we will discuss controversy over
- law enforcement,
- profiling tax cheaters
One of the most important and widespread business
applications of data mining is Customer Modeling, also called Predictive
Analytics. This includes tasks such as
- predicting attrition or churn, i.e. find which customers
are likely to terminate service
- targeted marketing:
- customer acquisition - find which prospects are likely
to become customers
- cross-sell - for given customer and product, find which
other product(s) they are likely to buy
- credit-risk - identify the risk that this customer will
not pay back the loan or credit card
- fraud detection - is this transaction fraudulent?
The largest users of Customer Analytics are industries such
as banking, telecom, retailers, where businesses with large numbers of
customers are making extensive use of these technologies.
1.2.1 Customer Attrition: Case Study
Let's consider a case study of mobile phone company. Typical
attrition (also called churn) rate at for mobile phone customers is around
25-30% a year!
The task is
- Given customer information for the past N (N can range
from 2 to 18 months), predict who is likely to attrite in next month or
- Also, estimate customer value and what is the
cost-effective offer to be made to this customer.
Verizon Wireless is the largest
wireless service provider in the United States with a customer base of 34.6
million subscribers as of 2003 (see http://www.kdnuggets.com/news/2003/n19/22i.html).
Verizon built a customer data warehouse that
- Identified potential attriters
- Developed multiple, regional models
- Targeted customers with high propensity to accept the
- Reduced attrition rate from over 2%/month to under
1.5%/month (huge impact over 34 million subscribers)
1.2.2 Assessing Credit Risk : Case Study
Let's consider a situation where a person applies for a loan.
Should a bank approve the loan?
Note: People who have the best
credit don't need the loans, and people with worst credit are not likely to
repay. Bank's best customers are in the middle.
Banks develop credit models using
variety of machine learning methods.
Mortgage and credit card
proliferation are the results of being able to successfully predict if a person
is likely to default on a loan. Credit risk assessment is universally used in
the US and widely deployed in most developed countries.
1.2.3 Successful e-commerce - Case Study
Amazon.com is the largest on-line retailer, which started
with books and expanded into music, electronics, and other products.
Amazon.com has an active data mining group, which focuses on personalization.
Why personalization? Consider a person that buys a book (product) at
Task: Recommend other books (and perhaps products) this
person is likely to buy
Amazon initial and quite successful effort was using clustering
based on books bought.
For example, customers who bought "Advances in Knowledge
Discovery and Data Mining", by Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, also bought "Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations" , by Witten and Eibe.
Recommendation program is quite successful and more advanced
programs are being developed.
1.2.4 Unsuccessful e-commerce - Case Study (KDD Cup 2000)
Of course application of data mining is no guarantee of
success and during the Internet bubble of 1999-2000, we have seen plenty of
Consider the legwear and legcare e-tailer Gazelle.com, whose
clickstream and purchase data from was the subject of KDD Cup 2000 competition
One of the questions was: Characterize visitors who spend
more than $12 on an average order at the site
The data included a dataset of 3,465 purchases, 1,831 customers
Very interesting and illuminating analysis was done by dozens
Cup participants. The total time spend was thousands of hours, which would
have been equivalent to millions of dollars in consulting fees.
However, the total sales of Gazelle.com were only a few
thousands of dollars and no amount of data mining could help them. Not
surprisingly, Gazelle.com went out of business in Aug 2000.
1.2.5 Genomic Microarrays - Case Study
DNA Microarrays are a revolutionary new technology that
allows measurement of gene expression levels for many thousands of genes simultaneously
(more about Microarrays later). Microarrays have recently become a popular
application area for data mining (see, for example, SIGKDD Explorations
Special Issue on Microarray Data Mining , Dec 2003 (Vol. 5, Issue 2)
One of the typical problems is, given microarray data for a
number of patients (samples), can we
Accurately diagnose the disease?
Predict outcome for given treatment?
Recommend best treatment?
Consider a Leukemia data set [Go99], with 72 samples, and about
7,000 genes. The samples belong to two classes Acute Lymphoblastic (ALL) and
Acute Myeloid (AML), which look similar under a microscope but have very
different genetic expression levels.
The best diagnostic model [PKR03] was learned on the
training set (38 samples) and applied to a test set (remaining 34 samples).
The results are were: 33 samples were diagnosed correctly (97% accuracy).
Interestingly, the one error was consistently mislabeled by
most algorithms and suspected to be mislabeled by the pathologist. So this may
be one example where computer-generated diagnostic is more accurate than the
1.2.6 Data Mining, Security and Fraud Detection
There are currently numerous applications of data mining for
security and fraud detection. One of the most common is Credit Card Fraud
Detection. Almost all credit card purchases are scanned by special algorithms
that identify suspicious transactions for further action. I have recently
received such a call from my bank, when I used a credit card to pay for a
journal published in England. This was an unusual transaction for me (first
purchase in the UK on this card) and the software flagged it.
Other applications include detection of money laundering - a
notable system, called FAIS, was developed by Ted Senator for the US Treasury
National Association of Securities Dealers (NASD) which runs
NASDAQ, has developed a system called Sonar that uses data mining for monitoring
insider trading and fraud through misrepresentation (http://www.kdnuggets.com/news/2003/n18/13i.html)
Many telecom companies, including AT&T, Bell Atlantic, British
Telecom/MCI have developed systems for catching phone fraud.
Data mining and security was also very much in the headlines
in 2003 with US Government efforts on using data mining for terrorism
detection, as part of the ill-named and now closed Total Information Awareness
Program (TIA). However, the problem of terrorism is unlikely to go away soon,
and government efforts are continuing as part of other programs, such as CAPPS
II or MATRIX.
Less controversial is use of data mining for bio-terrorism
detection, as was done at Salt Lake Olympics 2002 (the only thing that was
found was a small outbreak of tropical diseases). The system used there did a
very interesting analysis of unusual events - we will return to this topic
later in this course.
1.2.7 Problems Suitable for Data Mining
The previous case studies show some of the successful (and
unsuccessful) applications of data mining.
The areas where data mining applications are likely to be
successful have these characteristics:
- require knowledge-based decisions
- have a changing environment
- have sub-optimal current methods
- have accessible, sufficient, and relevant data
- provides high payoff for the right decisions
Also, if the problem involves people, then proper
consideration should be given to privacy -- otherwise, as TIA example shows,
the result will be a failure, regardless of technical issues.
1.3 Knowledge Discovery
We define Knowledge Discovery in Data (KDD) as the non-trivial
process of identifying
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining,
Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press
Knowledge Discovery is an interdisciplinary field, which
builds upon a foundation provided by databases and statistics and applies
methods from machine learning and visualization in order to find the useful
patterns. Other related fields include also information retrieval, artificial
intelligence, OLAP, etc.
Some people say that data mining is essentially a fancy name
for statistics. It is true that data mining has much in common with Statistics
and with Machine Learning. However, there are differences.
Statistics provides a solid theory for dealing with randomnerss
and tools for testing hypotheses. It does not study topics such as data
preprocessing or results visualization, which are part of data mining.
Machine learning has a more heuristic approach and is focused
on improving performance of a learning agent. It also has other subfields such
as real-time learning and robotics - which are not part of data mining. Data
Mining and Knowledge Discovery field integrates theory and heuristics. It focuses
on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results.
1.3.1 Knowledge Discovery Process
The key difference between Knowledge Discovery field emphasis
is on the process. KDD is not a single step solution of applying a machine
learning method to a dataset, but continuous process with many loops and
feedbacks. This process has been formalized by an industry group called
CRISP-DM, which stands for CRoss Industry Standard Process for Data Mining The
main steps in the process include:
1. Business (or Problem) Understanding
2. Data Understanding
3. Data Preparation (including all the data cleaning and
4. Modeling (applying machine learning and data mining
5. Evaluation (checking the performance of these algorithms
Although not officially part of CRISP, we should also
consider a 7th step - Monitoring, which completes the circle.
for more information on CRISP-DM.
1.3.2 Historical Note: Many names of Data Mining
Data Mining and Knowledge Discovery field has been called by
In 1960-s, statisticians have used terms like "Data Fishing"
or "Data Dredging" to refer to what they considered a bad practice of analyzing
data without an apriori hypothesis.
The term "Data Mining" appeared around 1990 in the database
community. Briefly, there was a phrase "database mining"™, but it was
trademarked by HNC (now part of Fair, Isaac), and researchers turned to
"data mining". Other terms used include Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, etc.
Gregory Piatetsky-Shapiro coined the term
"Knowledge Discovery in Databases" for the first workshop on the same topic (1989) and
this term became more popular in AI and Machine Learning Community.
However, the term data mining became more popular in business
community and in the press. As of Jan 2004, Google search for "data
mining" finds over 2,000,000 pages, while search for "knowledge discovery"
finds only 300,000 pages.
In 2003, "data mining" has acquired a bad image because of its
association with US government program of TIA (Total information awareness).
Headlines such as "Senate Kills Data Mining Program", ComputerWorld,
July 18, 2003, referring to US
Senate decision to close down TIA, show how much data mining became associated
Currently, Data Mining and Knowledge Discovery are used interchangeably,
and we also use these terms as synonyms.
1.4 Data Mining Tasks
Data mining is about many different types of patterns, and
there are correspondingly many types of data mining tasks. Some of the most
- Classification: predicting an item class
- Clustering: finding clusters in data
- Associations: e.g. A & B & C occur
- Visualization: to facilitate human discovery
- Summarization: describing a group
- Deviation Detection: finding changes
- Estimation: predicting a continuous value
- Link Analysis: finding relationships
Classification refers to learn a method for predicting the
instance class from pre-labeled (classified) instances. This is the most
popular task and there are dozens of approaches including statistics (logistic
regression), decision trees, neural networks, etc.
The module examples show difference between classification,
where we are looking for method that distinguish pre-classifiied groups, and
clustering, where no classes are given, and we want to find some "natural"
grouping of instances.
- Technology trends lead to data flood
- data mining is needed to make sense of data
- Data Mining has many applications, successful and not
- Data Mining and Knowledge Discovery
Knowledge Discovery Process
- Data Mining Tasks
- classification, clustering, ...
For more information on Data Mining and Knowledge Discovery,
- News, Publications
- Software, Solutions
- Courses, Meetings, Education
- Publications, Websites, Datasets
- Companies, Jobs