Lesson: Data Mining, and Knowledge Discovery: An IntroductionThis lesson is a brief introduction to the field of Data Mining (which is also sometimes called Knowledge Discovery). It is adapted from Module 1: Introduction, Machine Learning and Data Mining Course. 1.1 Data FloodThe current technological trends inexorably lead to data flood. More data is generated from banking, telecom, and other business transactions. More data is generated from scientific experiments in astronomy, space explorations, biology, high-energy physics, etc. More data is created on the web, especially in text, image, and other multimedia format. For example, Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second (yes, per second !) of astronomical data over a 25-day observation session. This truly generates an “astronomical” amount of data. AT&T handles so many calls per day that it cannot store all of the data – and data analysis has to be done “on the fly”. UC Berkeley analysis by Profs. Peter Lyman and Hal R. Varian estimated that 5 exabytes (5 million terabytes) of new data was created in 2002. Twice as much information was created in 2002 as in 1999 (~30% growth rate) . US produces ~40% of new stored data worldwide. As of 2003, according to Winter Corp. Survey, France Telecom has largest decision-support DB, ~30 TB (terabytes); AT&T was in second place with 26 TB database. Some of the largest databases on the Web, as of 2003, include
Knowledge Discovery is NEEDED to make sense and use of data.
1.2 Data Mining Application ExamplesThe areas where data mining has been applied recently include:
One of the most important and widespread business applications of data mining is Customer Modeling, also called Predictive Analytics. This includes tasks such as
The largest users of Customer Analytics are industries such as banking, telecom, retailers, where businesses with large numbers of customers are making extensive use of these technologies.
1.2.1 Customer Attrition: Case StudyLet's consider a case study of mobile phone company. Typical attrition (also called churn) rate at for mobile phone customers is around 25-30% a year! The task is
Verizon Wireless is the largest wireless service provider in the United States with a customer base of 34.6 million subscribers as of 2003 (see http://www.kdnuggets.com/news/2003/n19/22i.html). Verizon built a customer data warehouse that
1.2.2 Assessing Credit Risk : Case StudyLet's consider a situation where a person applies for a loan. Should a bank approve the loan? Note: People who have the best credit don't need the loans, and people with worst credit are not likely to repay. Bank's best customers are in the middle. Banks develop credit models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan. Credit risk assessment is universally used in the US and widely deployed in most developed countries. 1.2.3 Successful e-commerce – Case StudyAmazon.com is the largest on-line retailer, which started with books and expanded into music, electronics, and other products. Amazon.com has an active data mining group, which focuses on personalization. Why personalization? Consider a person that buys a book (product) at Amazon.com. Task: Recommend other books (and perhaps products) this person is likely to buy Amazon initial and quite successful effort was using clustering based on books bought. For example, customers who bought “Advances in Knowledge Discovery and Data Mining”, by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, also bought “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” , by Witten and Eibe. Recommendation program is quite successful and more advanced programs are being developed.
1.2.4 Unsuccessful e-commerce - Case Study (KDD Cup 2000)Of course application of data mining is no guarantee of success and during the Internet bubble of 1999-2000, we have seen plenty of examples. Consider the legwear and legcare e-tailer Gazelle.com, whose clickstream and purchase data from was the subject of KDD Cup 2000 competition (http://www.ecn.purdue.edu/KDDCUP/) One of the questions was: Characterize visitors who spend more than $12 on an average order at the site The data included a dataset of 3,465 purchases, 1,831 customers Very interesting and illuminating analysis was done by dozens Cup participants. The total time spend was thousands of hours, which would have been equivalent to millions of dollars in consulting fees. However, the total sales of Gazelle.com were only a few thousands of dollars and no amount of data mining could help them. Not surprisingly, Gazelle.com went out of business in Aug 2000. 1.2.5 Genomic Microarrays – Case StudyDNA Microarrays are a revolutionary new technology that allows measurement of gene expression levels for many thousands of genes simultaneously (more about Microarrays later). Microarrays have recently become a popular application area for data mining (see, for example, SIGKDD Explorations Special Issue on Microarray Data Mining , Dec 2003 (Vol. 5, Issue 2) www.acm.org/sigkdd/explorations/) One of the typical problems is, given microarray data for a number of patients (samples), can we Accurately diagnose the disease? Predict outcome for given treatment? Recommend best treatment? Consider a Leukemia data set [Go99], with 72 samples, and about 7,000 genes. The samples belong to two classes Acute Lymphoblastic (ALL) and Acute Myeloid (AML), which look similar under a microscope but have very different genetic expression levels. The best diagnostic model [PKR03] was learned on the training set (38 samples) and applied to a test set (remaining 34 samples). The results are were: 33 samples were diagnosed correctly (97% accuracy). Interestingly, the one error was consistently mislabeled by most algorithms and suspected to be mislabeled by the pathologist. So this may be one example where computer-generated diagnostic is more accurate than the human expert. 1.2.6 Data Mining, Security and Fraud DetectionThere are currently numerous applications of data mining for security and fraud detection. One of the most common is Credit Card Fraud Detection. Almost all credit card purchases are scanned by special algorithms that identify suspicious transactions for further action. I have recently received such a call from my bank, when I used a credit card to pay for a journal published in England. This was an unusual transaction for me (first purchase in the UK on this card) and the software flagged it. Other applications include detection of money laundering – a notable system, called FAIS, was developed by Ted Senator for the US Treasury [Se96]. National Association of Securities Dealers (NASD) which runs NASDAQ, has developed a system called Sonar that uses data mining for monitoring insider trading and fraud through misrepresentation (http://www.kdnuggets.com/news/2003/n18/13i.html) Many telecom companies, including AT&T, Bell Atlantic, British Telecom/MCI have developed systems for catching phone fraud. Data mining and security was also very much in the headlines in 2003 with US Government efforts on using data mining for terrorism detection, as part of the ill-named and now closed Total Information Awareness Program (TIA). However, the problem of terrorism is unlikely to go away soon, and government efforts are continuing as part of other programs, such as CAPPS II or MATRIX. Less controversial is use of data mining for bio-terrorism detection, as was done at Salt Lake Olympics 2002 (the only thing that was found was a small outbreak of tropical diseases). The system used there did a very interesting analysis of unusual events – we will return to this topic later in this course. 1.2.7 Problems Suitable for Data MiningThe previous case studies show some of the successful (and unsuccessful) applications of data mining. The areas where data mining applications are likely to be successful have these characteristics:
Also, if the problem involves people, then proper consideration should be given to privacy -- otherwise, as TIA example shows, the result will be a failure, regardless of technical issues. 1.3 Knowledge DiscoveryWe define Knowledge Discovery in Data (KDD) as the non-trivial process of identifying
from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996 Knowledge Discovery is an interdisciplinary field, which builds upon a foundation provided by databases and statistics and applies methods from machine learning and visualization in order to find the useful patterns. Other related fields include also information retrieval, artificial intelligence, OLAP, etc. Some people say that data mining is essentially a fancy name for statistics. It is true that data mining has much in common with Statistics and with Machine Learning. However, there are differences. Statistics provides a solid theory for dealing with randomnerss and tools for testing hypotheses. It does not study topics such as data preprocessing or results visualization, which are part of data mining. Machine learning has a more heuristic approach and is focused on improving performance of a learning agent. It also has other subfields such as real-time learning and robotics – which are not part of data mining. Data Mining and Knowledge Discovery field integrates theory and heuristics. It focuses on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results. 1.3.1 Knowledge Discovery ProcessThe key difference between Knowledge Discovery field emphasis is on the process. KDD is not a single step solution of applying a machine learning method to a dataset, but continuous process with many loops and feedbacks. This process has been formalized by an industry group called CRISP-DM, which stands for CRoss Industry Standard Process for Data Mining The main steps in the process include: 1. Business (or Problem) Understanding 2. Data Understanding 3. Data Preparation (including all the data cleaning and preprocessing) 4. Modeling (applying machine learning and data mining algorithms) 5. Evaluation (checking the performance of these algorithms 6. Deployment Although not officially part of CRISP, we should also consider a 7th step – Monitoring, which completes the circle. See www.crisp-dm.org for more information on CRISP-DM. 1.3.2 Historical Note: Many names of Data MiningData Mining and Knowledge Discovery field has been called by many names. In 1960-s, statisticians have used terms like “Data Fishing” or “Data Dredging” to refer to what they considered a bad practice of analyzing data without an apriori hypothesis. The term “Data Mining” appeared around 1990 in the database community. Briefly, there was a phrase “database mining”™, but it was trademarked by HNC (now part of Fair, Isaac), and researchers turned to “data mining”. Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc. Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” for the first workshop on the same topic (1989) and this term became more popular in AI and Machine Learning Community. However, the term data mining became more popular in business community and in the press. As of Jan 2004, Google search for "data mining" finds over 2,000,000 pages, while search for “knowledge discovery” finds only 300,000 pages.
In 2003, “data mining” has acquired a bad image because of its association with US government program of TIA (Total information awareness). Headlines such as “Senate Kills Data Mining Program”, ComputerWorld, July 18, 2003, referring to US Senate decision to close down TIA, show how much data mining became associated with TIA.
Currently, Data Mining and Knowledge Discovery are used interchangeably, and we also use these terms as synonyms. 1.4 Data Mining TasksData mining is about many different types of patterns, and there are correspondingly many types of data mining tasks. Some of the most popular are
Classification refers to learn a method for predicting the instance class from pre-labeled (classified) instances. This is the most popular task and there are dozens of approaches including statistics (logistic regression), decision trees, neural networks, etc. The module examples show difference between classification, where we are looking for method that distinguish pre-classifiied groups, and clustering, where no classes are given, and we want to find some “natural” grouping of instances.
1.5 Summary
Knowledge Discovery Process
For more information on Data Mining and Knowledge Discovery, including
visit www.KDnuggets.com. |
Copyright © 2006 KDnuggets. Subscribe to KDnuggets News!