KDnuggets : News : 2004 : n23 : item4 < PREVIOUS | NEXT >

Features


Subject: CSI: Data Warehouse - Can Benford's Law tell fake data?

intelligententerprise.com, By Joe Celko, Dec 4, 2004

My wife and I love all the CSI police procedural dramas that are so popular now. The crime lab crew gets to the crime site knowing nothing about the situation and finds the bad guy in 60 minutes. How about a show called CSI: Data Warehouse on Tech TV?

Imagine that you walk into a client who has a large amount of data, and he wants to know if his data is real or fake. You don't know anything about his data, or even his industry. It turns out that data qua data actually has some patterns that are fairly easy to find in a modern database. Let me give a quick overview, without much mathematics, of some of the easy ones.

The usual guess would be that all digits, one through nine, would be equally likely to pop up at the start of a string of digits. Nope, not true; Benford's Law says that it can be approximated by the formula P(d is first digit) = LOG10(1.0 + 1.0/d). The pattern is 30.1 percent for one, 17.6 percent for two, and down to 4.6 percent for nine. You can get some confirmation of this in "The First-Digit Phenomenon" by T. P. Hill (American Scientist, July-August 1998). Benford's Law gets better as the sample gets larger and more varied.

What makes Benford's Law useful to a data miner is that you don't have to understand the data. If the data drifts from the pattern, you know to look for a systematic bias or faked data. Like any statistic, it isn't a certainty, but it's a good place to start. In fact, there are fraud detection packages based on Benford's Law that look at patterns in expense reports and other financial data.

Here is the rest of the story.


KDnuggets : News : 2004 : n23 : item4 < PREVIOUS | NEXT >

Copyright © 2004 KDnuggets.   Subscribe to KDnuggets News!