Interview: Kaiser Fung, NYU on Why Ignoring Data Integrity is a Recipe for Disaster

We discuss different levels of Data Integrity, logical fallacies in Analytics, measures to boost accountability, role for human intelligence in Analytics and relevance of OCCAM framework.

kaiser-fungKaiser Fung provides training and advisory services in business analytics and data visualization. He held leadership roles in building and managing data teams at Vimeo, Sirius XM Radio, [X+1], and American Express. He is the creator of the acclaimed Junk Charts blog, which pioneered the genre of critically examining graphics in the media; and author of Numbers Rule Your World and Numbersense, both published by McGraw-Hill. He has an MBA from Harvard Business School, and is an adjunct professor at New York University. His work has appeared in Harvard Business Review, American Scientist, Significance, Financial Times, Wired, FiveThirtyEight, and CNN.

Here is my interview with him:

Anmol Rajpurohit: Q1. The Big Data revolution is partly based on the immense rise in our capability to measure and collect wide variety of data. You argue that a lot of this data is pushed into Analytics projects without being evaluated for correctness. So, are you suggesting that there is a gap between what we think of this data to be and what it actually is? Can you share some examples to illustrate this gap?

Kaiser Fung: I like to say “data integrity” rather than “correctness”. There are multiple levels of integrity. One level is value integrity, which most people recognize. Are there invalid values? Do values get dropped accidentally? Another level is label integrity. For example, there is code to track clicks on the button on the left side of a webpage. If the designer moves the button to the right side, the developer copies and pastes the previous tracking code but once in a while, forgets to edit the tag so the analyst continues to interpret the data as left-button clicks.
And then there is analytical integrity. Say, the traffic to your home page plunged last Monday because the tracking tag was inadvertently removed. The traffic existed, and just wasn’t measured, so the analyst extrapolated the missing value. However, no Web analytics software I know of has a solution to fix such mishaps permanently, and so anybody who ever looks at traffic data for any period that includes that Monday must make the adjustment. Needless to say, most analysts won’t even know about the anomaly.

AR: Q2. What are some of the most common logical fallacies in Analytics? How can we ensure to avoid them?


One fallacy I’d like to see less of is “story time.” This term describes a popular structure found in many data analyses: first, the author ropes readers in with details of the statistics and the data-driven models, and then comes a moment when the narrative becomes more elaborate, and drifts away from the data.

The Deflategate controversy during the recent Superbowl provides a nice example. A data analyst made noise over a statistic showing that deflate-gatethe New England Patriots fumbled far less often than other teams. Story time comes when the analyst slides from the data to the conclusion that the Patriots must have deflated footballs to achieve that statistic, a story on which the data shed no light at all. In the case of Warren Sharp, whose analysis of the fumble statistics went viral, I give him credit for avoiding story time--Sharp told readers he was speculating.

AR: Q3. As a business manager, how can one bolster accountability in Analytics processes? What factors would be indispensable for a checklist to ensure credibility of Analytics results?

KF: As analytics managers, we ought to taste our own dog food. We should quantify the impact of analytical projects. It is crucial to tie the outcomes to corporate metrics so business managers see the value of our work. For instance, if I run an A/B test on a specific page, and the results show a 10-percent increase of sales for daytime visitors, I express that gain in terms of total sales of all visitors, which is what the CEO cares about.

AR: Q4. As the world we live in becomes more and more data-driven, do you see any role for gut instinct? What is the most important role of human intelligence in Analytics?

KF: When I talk about “numbersense,” my point is that intuition or gut feeling plays an indispensable role in data analysis. The conventional wisdom is that data and intuition are polar opposites. That is a myth. The best data analysts are the ones who excels at harnessing their intuition to guide the analytical work. In my book, I argue that data analysis is inherently subjective.
AR: Q5. What is the OCCAM framework? How is it relevant?

KF: OCCAM is the acronym for a set of characteristics that are becoming more and more prevalent in new “Big Data” datasets and make the analyses of such datasets challenging. In short, many new occamdatasets lack any kind of design, do not include explicit controls, are seemingly complete, are adapted from other purposes, and result from a series of imperfect merges. For example, I believe an Achilles heel of using Google search data to predict flu trends is that engineers are constantly tweaking the underlying search engine without concern for those adapting the data to predict flu trends or fulfill any number of other purposes. Please see my blog for more details on OCCAM. (

Second part of the interview