Why understanding of truth is important in Data Science?

Data Science can be used to discover correlations (What phenomena occurred) but cannot be used to establish causality (Why the phenomena occurred).

By Michael L. Brodie, CSAIL, MIT

The informal, social notion of absolute truth that often lies deep in our consciousness has no place in data science. Such a deeply embedded belief may inhibit our ability to discover systematically observable properties of the phenomena, i.e., plausible models, that we are exploring since belief in a single truth violates the scientific method by anticipating a specific outcome, i.e., creating a bias towards a specific outcome.

Data Science
In data science, as in science, we are trying to discover plausible hypotheses that might be proven under specific conditions to systematically recur, e.g., a cancer that occurs in a specific context due to specific factors and might be minimized or eliminated by means of a specific treatment.

Data science involves hypothesizing (theory-driven or top-down) or discovering (data-driven or bottom up) systematically observable properties of a phenomenon, i.e., a model, under specific conditions.[2] Belief in an absolute truth may suggest that there is only one model, one set of properties. As Plato’s allegory of the cave taught us 2,400 years ago, we cannot observe the “real” thing, we can observe only an image of the thing (a model) from our perspective. In science, as in life, understanding of a phenomenon may be enriched by observing the phenomenon from multiple perspectives (models). A recent scientific trend, e.g., pursued in biology by Pardis Sabeti at Harvard involves a shift from understanding a phenomenon with one theory (perspective) to using multiple theories or models in what is called ensemble modeling. Ensemble modeling is the process of running two or more related but different analytical models and then synthesizing the results into a single score or spread in order to improve the accuracy of predictive analytics and data mining applications.


My definition of Data Science is based on the definition of the scientific method as a process of acquiring new knowledge, and correcting and integrating previous knowledge, meaning that it is part of a continuous discovery process.


Data Science is a body of principles and techniques for applying data analytics to accelerate the investigation of phenomena by acquiring new data, correcting and combining it with previous data, with measures of correctness, completeness, and efficiency of the derived results (correlations) with respect to some pre-defined (theoretical, deductive,top down) or emergent (inductive, bottom up) specification (scope, question, hypothesis, requirement).[3]


Correlation Causation
What versus Why: Data Science can be used to discover correlations (What phenomena occurred) but cannot be used to establish causality (Why the phenomena occurred).

Data Science involves discovering What – significant facts or patterns concerning phenomena. These are called correlations amongst variables. Ideally, data science methods will help us identify highly probably (plausible) hypotheses (correlations) that will be proven causal by other means. Data Science involves accelerated methods of discovering THAT correlations occur under certain conditions and with certain probabilities; it cannot discover Why – whether the correlations between variables are causal, i.e., explain why the observed correlations occurred. Once data science has been used to establish one or more highly probable hypotheses (correlations), we put aside data science and turn to the conventional methods of the domain in question to establish causality or Why the observed phenomena occurred.

Single Version of Truth: Banks must maintain a single version of truth for your bank account, not multiple versions, since you want the bank to make sure that every euro you put in is credited to you and every euro taken out is credited to the person you are paying. “One version of truth” applies to most businesses that want a persistent, reliable record of all business transactions. Databases were first developed for banking and business; hence they claim to support a single version of truth. While this is critical for some problems, e.g., business transactions, it is not true for most of the rest of the world. Hence, database products do not support multiple models, i.e., the reality of science and life in general. For over 40 years, researchers have tried but failed to develop databases that support multiple perspectives or multiple semantic models.

Most assertions are unprovable: 98% of what people say are opinions that are impossible to prove as “true”. The previous sentence is an opinion, hence unprovable. However, it suggests that almost all assertions are mere opinions and should be considered as opinions.

What is a bias? Understanding a phenomenon means that we have knowledge of the phenomenon. Following the above discussion of truth, our knowledge – ideally verifiable, systematic observations under specific conditions – is relative to the data we have and the models (perspectives) that we have used to establish the knowledge (informally truths of the phenomenon). Recently, it has been observed that algorithms used in many areas (mortgage and loan approvals, hiring and promotion, parole and sentencing) are biased. To be biased means to be prejudiced in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair. For example, automated parole systems have been shown by ProPublica to be biased.

Specifically, ProPublica showed that automated parole systems systematically made parole decisions that disadvantaged minorities, i.e., blacks and females, with all other factors being equal. In the terms used above, the automated parole system is based on a model that is inconsistent with a community model for fairness towards minorities. This could be that the firms that designed the systems believe based on some evidence (i.e., knowledge) that minorities recommit more than whites and males. Even though the political disposition of the community is that minorities need to be treated fairly just like non-minorities, i.e., receive the same sentences. When is knowledge biased? When the knowledge used to produce a model is in conflict with another model, then the two models are biased with respect to each other. Assuming the knowledge on which the parole system model is based is verifiable under the conditions in which it is applied, it is biased with respect to a model based on fairness to minorities that may be a political aspiration rather than a reality. How do you prove veracity of a model? In this case what is the recidivism rate for the automated parole system versus a parole system based on a model that supports politically fairness to minorities? If the original parole system model has a better recidivism rate than that of the fairness model, does society select better recidivism over fairness? This is a modelling question that is outside the realm of data science. A deeper question is how do you detect bias in algorithms? You need to evaluate and compare the models underlying the algorithm versus some other model. Only models can be biased with respect to each other.


Marcus Aurelius, 121-180 AD.

"Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth."

Shakespeare 1564-1616:  The Tragedy of Hamlet, Prince of Denmark, Act 2, scene 2

Hamlet: Why, then, 'tis none to you, for there is nothing either good or bad, but thinking makes it so. To me it is a prison. [A reference to Hamlet’s earlier “Denmark's a prison.”

Modernized: Well, then it isn't one to you, since nothing is really good or bad in itself—it's all what a person thinks about it. And to me, Denmark is a prison.

[1] M. Braschler, T. Stadelmann, K. Stockinger (Eds.), “Applied Data Science - Lessons Learned for the Data-Driven Business”, Berlin, Heidelberg: Springer, expected 2018
[2] M.L. Brodie, Necessity is the Mother of Invention: On Developing Data Science, to appear in [1]
[3] M.L. Brodie, What is Data Science? to appear in[1]

Note: This post was a result from a discussion with Dora (Θεοδωρα Μπουκουρα), an inquisitive Greek Biology student interested in Data Science, following my keynote: Data: The World’s Most Valuable Resource, 2017 Onassis Lectures in Computer Science on Big Data and Applications, Heraklion, Crete, Greece, July 10, 2017

Bio: Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, AI, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist at CSAIL, MIT; advisor at Tamr.com; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway and at the U. of Technology, Sydney.