This lesson is a brief introduction to the field of Data Mining (which is also sometimes called Knowledge Discovery). It is adapted from Module 1: Introduction, Machine Learning and Data Mining Course.
1.1 Data Flood
The current technological trends inexorably lead to data flood. More data is generated from banking, telecom, and other business transactions. More data is generated from scientific experiments in astronomy, space explorations, biology, high-energy physics, etc. More data is created on the web, especially in text, image, and other multimedia format.
For example, Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second (yes, per second !) of astronomical data over a 25-day observation session.
This truly generates an "astronomical" amount of data.
AT&T handles so many calls per day that it cannot store all of the data - and data analysis has to be done "on the fly".
UC Berkeley analysis by Profs. Peter Lyman and Hal R. Varian estimated that 5 exabytes (5 million terabytes) of new data was created in 2002. Twice as much information was created in 2002 as in 1999 (~30% growth rate) . US produces ~40% of new stored data worldwide.
As of 2003, according to Winter Corp. Survey, France Telecom has largest decision-support DB, ~30 TB (terabytes); AT&T was in second place with 26 TB database.
Some of the largest databases on the Web, as of 2003, include
- Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB
- Internet Archive (www.archive.org),~ 300 TB
- Google, over 4 Billion pages, many, many TB
Knowledge Discovery is NEEDED to make sense and use of data.
1.2 Data Mining Application Examples
The areas where data mining has been applied recently include:
- drug discovery, ...
- Customer modeling and CRM (Customer Relationship management)
- fraud detection
- health care, ...
- telecom (telephone and communications),
- targeted marketing,
- search engines, bots, ...
- anti-terrorism efforts (we will discuss controversy over privacy later)
- law enforcement,
- profiling tax cheaters
One of the most important and widespread business applications of data mining is Customer Modeling, also called Predictive Analytics. This includes tasks such as
- predicting attrition or churn, i.e. find which customers are likely to terminate service
- targeted marketing:
- customer acquisition - find which prospects are likely to become customers
- cross-sell - for given customer and product, find which other product(s) they are likely to buy
- credit-risk - identify the risk that this customer will not pay back the loan or credit card
- fraud detection - is this transaction fraudulent?
The largest users of Customer Analytics are industries such as banking, telecom, retailers, where businesses with large numbers of customers are making extensive use of these technologies.
1.2.1 Customer Attrition: Case Study
Let's consider a case study of mobile phone company. Typical attrition (also called churn) rate at for mobile phone customers is around 25-30% a year!
The task is
- Given customer information for the past N (N can range from 2 to 18 months), predict who is likely to attrite in next month or two.
- Also, estimate customer value and what is the cost-effective offer to be made to this customer.
Verizon Wireless is the largest wireless service provider in the United States with a customer base of 34.6 million subscribers as of 2003 (see http://www.kdnuggets.com/news/2003/n19/22i.html). Verizon built a customer data warehouse that
- Identified potential attriters
- Developed multiple, regional models
- Targeted customers with high propensity to accept the offer
- Reduced attrition rate from over 2%/month to under 1.5%/month (huge impact over 34 million subscribers)
1.2.2 Assessing Credit Risk : Case Study
Let's consider a situation where a person applies for a loan.
Should a bank approve the loan?
Note: People who have the best credit don't need the loans, and people with worst credit are not likely to repay. Bank's best customers are in the middle.
Banks develop credit models using variety of machine learning methods.
Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan. Credit risk assessment is universally used in the US and widely deployed in most developed countries.
1.2.3 Successful e-commerce - Case Study
Amazon.com is the largest on-line retailer, which started with books and expanded into music, electronics, and other products. Amazon.com has an active data mining group, which focuses on personalization. Why personalization? Consider a person that buys a book (product) at Amazon.com.
Task: Recommend other books (and perhaps products) this person is likely to buy
Amazon initial and quite successful effort was using clustering based on books bought.
For example, customers who bought "Advances in Knowledge Discovery and Data Mining", by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, also bought "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations" , by Witten and Eibe.
Recommendation program is quite successful and more advanced programs are being developed.
1.2.4 Unsuccessful e-commerce - Case Study (KDD Cup 2000)Of course application of data mining is no guarantee of success and during the Internet bubble of 1999-2000, we have seen plenty of examples.
Consider the legwear and legcare e-tailer Gazelle.com, whose clickstream and purchase data from was the subject of KDD Cup 2000 competition (http://www.ecn.purdue.edu/KDDCUP/)
One of the questions was: Characterize visitors who spend more than $12 on an average order at the site
The data included a dataset of 3,465 purchases, 1,831 customers
Very interesting and illuminating analysis was done by dozens Cup participants. The total time spend was thousands of hours, which would have been equivalent to millions of dollars in consulting fees.
However, the total sales of Gazelle.com were only a few thousands of dollars and no amount of data mining could help them. Not surprisingly, Gazelle.com went out of business in Aug 2000.
1.2.5 Genomic Microarrays - Case Study
DNA Microarrays are a revolutionary new technology that allows measurement of gene expression levels for many thousands of genes simultaneously (more about Microarrays later). Microarrays have recently become a popular application area for data mining (see, for example, SIGKDD Explorations Special Issue on Microarray Data Mining , Dec 2003 (Vol. 5, Issue 2) www.acm.org/sigkdd/explorations/)
One of the typical problems is, given microarray data for a number of patients (samples), can we
Accurately diagnose the disease?
Predict outcome for given treatment?
Recommend best treatment?
Consider a Leukemia data set [Go99], with 72 samples, and about 7,000 genes. The samples belong to two classes Acute Lymphoblastic (ALL) and Acute Myeloid (AML), which look similar under a microscope but have very different genetic expression levels.
The best diagnostic model [PKR03] was learned on the training set (38 samples) and applied to a test set (remaining 34 samples). The results are were: 33 samples were diagnosed correctly (97% accuracy).
Interestingly, the one error was consistently mislabeled by most algorithms and suspected to be mislabeled by the pathologist. So this may be one example where computer-generated diagnostic is more accurate than the human expert.
1.2.6 Data Mining, Security and Fraud Detection
There are currently numerous applications of data mining for security and fraud detection. One of the most common is Credit Card Fraud Detection. Almost all credit card purchases are scanned by special algorithms that identify suspicious transactions for further action. I have recently received such a call from my bank, when I used a credit card to pay for a journal published in England. This was an unusual transaction for me (first purchase in the UK on this card) and the software flagged it.
Other applications include detection of money laundering - a notable system, called FAIS, was developed by Ted Senator for the US Treasury [Se96].
National Association of Securities Dealers (NASD) which runs NASDAQ, has developed a system called Sonar that uses data mining for monitoring insider trading and fraud through misrepresentation (http://www.kdnuggets.com/news/2003/n18/13i.html)
Many telecom companies, including AT&T, Bell Atlantic, British Telecom/MCI have developed systems for catching phone fraud.
Data mining and security was also very much in the headlines in 2003 with US Government efforts on using data mining for terrorism detection, as part of the ill-named and now closed Total Information Awareness Program (TIA). However, the problem of terrorism is unlikely to go away soon, and government efforts are continuing as part of other programs, such as CAPPS II or MATRIX.
Less controversial is use of data mining for bio-terrorism detection, as was done at Salt Lake Olympics 2002 (the only thing that was found was a small outbreak of tropical diseases). The system used there did a very interesting analysis of unusual events - we will return to this topic later in this course.
1.2.7 Problems Suitable for Data Mining
The previous case studies show some of the successful (and unsuccessful) applications of data mining.
The areas where data mining applications are likely to be successful have these characteristics:
- require knowledge-based decisions
- have a changing environment
- have sub-optimal current methods
- have accessible, sufficient, and relevant data
- provides high payoff for the right decisions
Also, if the problem involves people, then proper consideration should be given to privacy -- otherwise, as TIA example shows, the result will be a failure, regardless of technical issues.
1.3 Knowledge Discovery
We define Knowledge Discovery in Data (KDD) as the non-trivial process of identifying
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Knowledge Discovery is an interdisciplinary field, which builds upon a foundation provided by databases and statistics and applies methods from machine learning and visualization in order to find the useful patterns. Other related fields include also information retrieval, artificial intelligence, OLAP, etc.
Some people say that data mining is essentially a fancy name for statistics. It is true that data mining has much in common with Statistics and with Machine Learning. However, there are differences.
Statistics provides a solid theory for dealing with randomnerss and tools for testing hypotheses. It does not study topics such as data preprocessing or results visualization, which are part of data mining.
Machine learning has a more heuristic approach and is focused on improving performance of a learning agent. It also has other subfields such as real-time learning and robotics - which are not part of data mining. Data Mining and Knowledge Discovery field integrates theory and heuristics. It focuses on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results.
1.3.1 Knowledge Discovery Process
The key difference between Knowledge Discovery field emphasis is on the process. KDD is not a single step solution of applying a machine learning method to a dataset, but continuous process with many loops and feedbacks. This process has been formalized by an industry group called CRISP-DM, which stands for CRoss Industry Standard Process for Data Mining The main steps in the process include:
1. Business (or Problem) Understanding
2. Data Understanding
3. Data Preparation (including all the data cleaning and preprocessing)
4. Modeling (applying machine learning and data mining algorithms)
5. Evaluation (checking the performance of these algorithms
Although not officially part of CRISP, we should also consider a 7th step - Monitoring, which completes the circle.
See www.crisp-dm.org for more information on CRISP-DM.
1.3.2 Historical Note: Many names of Data MiningData Mining and Knowledge Discovery field has been called by many names.
In 1960-s, statisticians have used terms like "Data Fishing" or "Data Dredging" to refer to what they considered a bad practice of analyzing data without an apriori hypothesis.
The term "Data Mining" appeared around 1990 in the database community. Briefly, there was a phrase "database mining"™, but it was trademarked by HNC (now part of Fair, Isaac), and researchers turned to "data mining". Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc.
Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic (1989) and this term became more popular in AI and Machine Learning Community.
However, the term data mining became more popular in business community and in the press. As of Jan 2004, Google search for "data mining" finds over 2,000,000 pages, while search for "knowledge discovery" finds only 300,000 pages.
In 2003, "data mining" has acquired a bad image because of its association with US government program of TIA (Total information awareness). Headlines such as "Senate Kills Data Mining Program", ComputerWorld, July 18, 2003, referring to US Senate decision to close down TIA, show how much data mining became associated with TIA.
Currently, Data Mining and Knowledge Discovery are used interchangeably, and we also use these terms as synonyms.
1.4 Data Mining TasksData mining is about many different types of patterns, and there are correspondingly many types of data mining tasks. Some of the most popular are
- Classification: predicting an item class
- Clustering: finding clusters in data
- Associations: e.g. A & B & C occur frequently
- Visualization: to facilitate human discovery
- Summarization: describing a group
- Deviation Detection: finding changes
- Estimation: predicting a continuous value
- Link Analysis: finding relationships
Classification refers to learn a method for predicting the instance class from pre-labeled (classified) instances. This is the most popular task and there are dozens of approaches including statistics (logistic regression), decision trees, neural networks, etc.
The module examples show difference between classification, where we are looking for method that distinguish pre-classifiied groups, and clustering, where no classes are given, and we want to find some "natural" grouping of instances.
- Technology trends lead to data flood
- data mining is needed to make sense of data
- Data Mining has many applications, successful and not
- Data Mining and Knowledge Discovery
Knowledge Discovery Process
- Data Mining Tasks
- classification, clustering, ...
For more information on Data Mining and Knowledge Discovery, including
- News, Publications
- Software, Solutions
- Courses, Meetings, Education
- Publications, Websites, Datasets
- Companies, Jobs