Data Mining Course Outline

Parts of this course are based on textbook Witten and Eibe, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 1999 and 2nd Edition (2005), (W&E). The course will be using Weka software and the final project will be a KDD-Cup-style competition to analyze DNA microarray data.

The course is organized as 19 modules (lectures) of 75 minutes each.
(*) marks more advanced topics which can be skipped for a less advanced course.


M1: Introduction: Machine Learning and Data Mining

  • Data Flood
  • Data Mining Application Examples
  • Data Mining and Knowledge Discovery
  • Data Mining Tasks
Study: Course Notes,
Introduction to KDD (AI Mag 1996) (

M2: Machine Learning and Classification

  • Machine Learning and Classification
  • Examples
  • Learning as Search
  • Bias
  • Weka
Study: W&E, Chapter 1.

M3. Input: Concepts, instances, attributes

  • What is a concept?
  • What is an example?
  • What is an attribute?
  • Preparing the data
Study: W&E, Chapter 2.

M4. Output: Knowledge Representation

  • Decision tables
  • Decision trees
  • Decision rules
  • Rules involving relations
  • Instance-based representation
Study: W&E, Chapter 3.

M5. Classification - Basic methods

  • OneR
  • NaiveBayes
Study: W&E, Chapter 4

M6: Classification: Decision Trees

  • Top-Down Decision Trees
  • Choosing the Splitting Attribute
  • Information Gain and Gain ratio
Study: W&E, Chapter 4

M7: Classification: C4.5

  • Handling Numeric Attributes
      Finding Best Split
  • Dealing with Missing Values
  • Pruning
      Pre-pruning, Post-Pruning, Estimating Error Rates
  • From Trees to Rules
Study: W&E, Chapter 5

M8: Classification: CART

  • CART Overview and Gymtutor Tutorial Example
  • Splitting Criteria
  • Handling Missing Values
  • Pruning
      Finding Optimal Tree
Study: CART Tutorial, CART Manual,

M9: Classification: more methods

  • Rules
  • Regression
  • Instance-based (Nearest neighbor)
Study: W&E, Chapter 4

M10: Evaluation and Credibility

  • Introduction
  • Classification with Train, Test, and Validation sets
      Handling Unbalanced Data; Parameter Tuning
  • *Predicting Performance
  • Evaluation on "small data": Cross-validation
  • *Bootstrap
  • Comparing Data Mining Schemes
  • *Choosing a Loss Function
Study: W&E, Chapter 5.

M11: Evaluation - Lift and Costs

  • Lift and Gains charts
  • *ROC
  • Cost-sensitive learning
  • Evaluating numeric predictions
  • MDL principle and Occam's razor
Study: W&E, Chapter 5.

M12: Data Preparation for Knowledge Discovery

  • Data understanding
  • Data cleaning
  • Date transformation
  • Discretization
  • False "predictors" (information leakers)
  • Feature reduction, leaker detection
  • Randomization
  • Learning with unbalanced data

Study: Course notes

M13: Clustering

  • Introduction
  • K-means
  • Hierarchical

Study: W&E, Course notes

M14: Associations

  • Transactions
  • Frequent itemsets
  • Association rules
  • Applications

Study: Course notes

M15: Visualization

  • Graphical excellence and lie factor
  • Representing data in 1,2, and 3-D
  • Representing data in 4+ dimensions
    • Parallel coordinates
    • Scatterplots
    • Stick figures
    • ...

Study: Course notes

M16: Summarization and Deviation Detection

  • Summarization
  • KEFIR: Key Findings Reporter
  • WSARE: What is Strange About Recent Events
Study: KEFIR book chapter and demo,
Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks, by Weng-Keen Wong et al (about WSARE system).

M17: Applications: Targeted Marketing and Customer Modeling

  • Direct Marketing Review
  • Evaluation: Lift, Gains
  • KDD Cup 1997
  • Lift and Benefit estimation
  • KDD Cup 1998
Study: KDD Cup 1997 report, KDD Cup 1998 report,
G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, Proc. KDD-99, ACM.

M18: Applications: Genomic Microarray Data Analysis

Study: SIGKDD Explorations Special Issue on Microarray Data Mining,
Capturing Best Practice for Microarray Gene Expression Data Analysis, G. Piatetsky-Shapiro, T. Khabaza, S. Ramaswamy, in Proceedings of KDD-2003.

M19: Data Mining and Society; Future Directions

  • Data Mining and Society: Ethics, Privacy, and Security issues
  • Future Directions for Data Mining
    web mining, text mining, multi-media data
  • Course Summary
Study: Knowledge Discovery in Databases vs. Personal Privacy Symposium, editor Gregory Piatetsky-Shapiro, IEEE Expert, April 1995.

Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003.