Big Data = Big Trouble and Noise vs. Signal

Nassim Taleb (of the Black Swan fame) argues that more data is more trouble, while Google scientists talk about "Unreasonable Effectiveness of Data.". Who is right?

WhatstheBigData.com, Gil Press, May 30, 2012

An excerpt from Nassim Taleb's forthcoming book, Antifragile, was posted yesterday on the Farnam Street blog.

In "Noise and Signal," Taleb says that

"In business and economic decision-making, data causes severe side effects -data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well discussed property of data: it is toxic in large quantities-even in moderate quantities.... the best way... to mitigate interventionism is to ration the supply of information, as naturalistically as possible. This is hard to accept in the age of the internet. It has been very hard for me to explain that the more data you get, the less you know what's going on, and the more iatrogenics you will cause."

In other words, big data equals big trouble. Taleb is right to warn of the dangers of blindly falling in love with data and we are all familiar with the dangers of data-driven mis-diagnosis and intervention not just in healthcare but in policy making, education, and business decisions.

But making a sweeping statement that more data is always bad is also dangerous. Is intuition (i.e., no data) always better? Is a small amount of data always better than lots of data? Does noise rises proportionally to the increase in the volume of data?

Many data scientists today would answer with a resounding "no." More data is always better, they argue. Their major reference point is the 2009 paper by Alon Halevy, Peter Norvig, and Fernando Pereira, "The Unreasonable Effectiveness of Data." It talks about how Google's "trillion-word corpus with frequency counts for all sequences up to five words long" can serve as "the basis of a complete model for certain tasks-if only we knew how to extract the model from the data." The paper demonstrates the usefulness of this corpus of data for language processing applications and argues for the superiority of this wealth of data over pre-conceived ontologies.

Read more.