KDnuggets Home » News » 2012 » Jun » Publications » Nuts and Bolts of Data Mining: The Histogram  (  Prev | 12:n15 | Next  )

# Nuts and Bolts of Data Mining: The Histogram

Tim Graettinger reviews a real workhorse for data mining and analysis - the histogram, including the most frequent types: "money", "count", and "outlier".

By Tim Graettinger, June 2012.

Practitioners, business users, developers, and academics love new data mining tools and methods. And yet, successful data mining requires much more than powerful tools. For all the strides that data mining tools have made during my 20-year career, using them well and interpreting their results still requires hard work and serious, critical thought. Remember, "A fool with a tool is still a fool[f1]."

That's why I've been writing a series of articles on the nuts and bolts of data mining. In this series, we're reviewing what it takes to be successful with data mining, what the common pitfalls are, how to avoid or remedy problems, and how to interpret results.

This article adresses a real workhorse for data mining and analysis, the histogram. Histograms are bar charts that display the frequency distribution of a numeric quantity, like home value or income. The most famous frequency distribution is the classic bell-shaped curve, also known as the "normal" distribution[f2]. Although the bell-shaped histogram is well-known and well understood mathematically, it does not occur that often in actual real-world practice. Among the histograms encountered most frequently in practice are the following: "money", "count", and "outlier". We will look at each one of them in turn.

The Money Histogram

"Money" histograms arise in practice when financial data are plotted. The data are usually transaction amounts - home values, salaries, prices paid for products, gift amounts donated to a charity - that are always positive. Figure 1 displays a sample of home values. There is a left-hand "wall" at 0, and the data pushes out to the right to higher and higher positive values. Notice that there are so few of the very high values that they don't even show up on the chart.