KDnuggets : News : 2009 : n16 : item30 < PREVIOUS | NEXT >

Publications

From: Dan Graettinger
Date: Thu, 13 Aug 2009
Subject: Data Mining Misconceptions #1: The 50/50 Problem

This article by Tim Graettinger, president of Discovery Corps, helps to explain, in language the average businessperson can understand, some of the misconceptions which potential clients have about data mining.

It was published in the most recent issue of The Data Administration Newsletter [TDAN].

Discovery Corps, Inc. is a data visualization, data mining, and predictive analytics consultancy headquartered outside of Pittsburgh, PA.

Data Mining Misconceptions #1: The 50/50 Problem

By Tim Graettinger

This fall will mark my twentieth year as a data mining professional. Thank you. During that time, I worked at five different companies - mostly startups - and consulted for many, many clients. Changes to the data mining field during that period are startling in terms of the computational horsepower available, the size of the databases being generated, and the software tools developed to model and analyze them. At the same time, scant progress has been made in educating the public, in general, and clients, in particular, about data mining. There are many untruths, half-truths, and downright false statistics floating around about how data mining works and how it is used. In this and future articles, I intend to clear up a few of the most pervasive of these misconceptions.

Some misconceptions arise from simple errors in logic. Often, they stem from a lack of familiarity or experience. None are particularly technical problems. All are easily remedied with simple examples and simple explanations. In this article, I will focus on one misconception that I call the "50/50 problem."

An Example of the 50/50 Problem

Recently, I was working with a very bright, energetic client in the biotech industry. Her firm builds imaging equipment and provides services to pharmaceutical companies. The imaging equipment (calling it a complex, microscope-like camera is far too wordy) generated data that she wanted to use to classify chemical compounds as promising or unpromising candidates for drugs. It turns out that in the vast world of chemical compounds, there are more unpromising drug candidates than promising ones - a lot more. My job was to use data mining techniques to create a classifier (a mathematical formula or a set of rules) that would successfully distinguish promising drug candidates from unpromising ones - using data produced by the imaging equipment.

After some initial work, I presented a classifier to my client. I happily reported that the classifier correctly labeled promising compounds as promising 10% of the time. My client was completely underwhelmed1. Her knee-jerk response was, "But you can do 50% just by flipping a coin!"

Actually, a very simple classifier can do much better than 50%. I mentioned earlier that there are many more unpromising compounds than promising ones. In this project, 999 out of every 1000 compounds was unpromising, or 99.9%. A classifier that labels every compound as unpromising is correct 99.9% of the time. Despite its apparently high accuracy, such a classifier is worthless to a pharmaceutical company. Why? Such a classifier would recommend that no compound ever be developed further as a potential drug. Strictly abiding by the classifier, life-saving research would come to an abrupt halt.

The 50/50 Problem in a Nutshell

Is a misconception becoming evident? My client, like many intelligent people, made a simple error in thinking. She made the assumption, because there were two possible outcomes (promising and unpromising), that the outcomes were both 50% likely. This is the "50/50 problem."

Read the rest of this article at the Discovery Corps website


KDnuggets : News : 2009 : n16 : item30 < PREVIOUS | NEXT >

Copyright © 2009 KDnuggets.   Subscribe to KDnuggets News!