Doubt and Verify: Data Science Power Tools
In the end, there is no truth, no ultimate ground truth, no lie-free utterances, as everything is contextual based on incomplete facts and knowledge. All world models are flawed, but Data Science has 2 power tools.
By Michael L. Brodie (CSAIL, MIT)
Doubt everything. Use evidence-based methods to verify things that matter.
Did Our Teachers Lie to Us? Do Doctors, Lawyers, and Other Professionals Lie?
I had wonderful, engaged and engaging, deeply knowledgeable, and opinionated teachers in my rural Canadian high school, especially in history (“Grannie” Smith and Mr. Snider), chemistry (Mr. Fish and Mr. Thompson), and English (“Wild Bill” Elliot). They were passionate about their topics, taught us amazing things, and made huge contributions to my “world model”.
To what extent were our teacher’s facts true and their conclusions or “theories” accurate reflections of the state of the world that they taught? In history, English, chemistry, and most domains there is substantial continuing research resulting in changes to understanding those aspects of the world. Was Waterloo fought at Waterloo? Were the Dark Ages truly dark in any significant way?
Similarly, in the 1980’s Boston physicians voted Dr. Solomon, my doctor, the best doctor in Boston. He has been marvelous, providing me 33 years of wonderful healthcare. As in the humanities and sciences, there has been massive research and consequent progress in healthcare, fitness, medicine, and pharmacology leading to radically new understandings in those areas.
The job of a professional - doctor, lawyer, teacher - is to do the best that they can to ensure good health, the best legal protection, and the best education, based on the most current information and knowledge in their area - the most current medical practices, avoiding proven bad practices, the most current laws, the most current understanding of history, chemistry, and English. What is the chance that any professional can satisfy these requirements?
Secondly, to what extent can they do their job in a neutral or unbiased way? While having a bias is highly likely, do professionals understand that they have a bias and do they make clear to their “customers” that they have a bias and its nature so that customers can choose either the biased information or select information that best suits their needs? While doctors, lawyers, and teachers may follow a specific orientation (e.g., bias for or against using certain drugs or procedures, specific interpretations of the law or of history), in principle, they should present their material in an unbiased way to allow customers to formulate and select amongst alternative views.
In all fields new facts and knowledge are constantly being produced based on new data, discoveries, experience, and research - far more than a single individual can absorb let alone put into practice. So how do professionals - how does anyone:
- Understand that they have a bias, its nature and limitations?
- Weigh their bias and those of others in their thinking and actions?
- Re-evaluate their knowledge (world view) in light of new facts (“ground truth”) and conclusions?
Dealing with these issues is a massive undertaking both conceptually (changing their world model) and practically (keeping up to date) in addition to and often separate from their day job of patient care, legal cases, and teaching. Since facts and conclusions (theories) change so fast, it is unlikely that most professionals are “up to date”, notwithstanding the fact that there is no consensus as to what it means to be “up to date”. The Higgs Boson was not discovered overnight (proven within five sigma) in 2012 - a marvelous year for high energy particle physics - but over 40 years and indeed to ten sigma only in 2014.
Fig 1. Is there a Higgs Boson In There?
Since knowledge is the cumulative result of learning and forgetting - verifying constantly emerging facts and deducing (empirically or theoretically) new conclusions – our knowledge (world model) is necessarily incomplete hence highly likely to be faulty and biased (oriented towards specific beliefs). It is highly unlikely that a professional operates in an unbiased way in full knowledge of empirically proven facts and theories. Hence, as history has often proven (e.g., models of “elementary particle” physics) much education and professional activities are based on “faulty learning”.
- All knowledge (e.g., professional world models) is necessarily incomplete, faulty, and biased; hence students, patients, and clients receive faulty knowledge and work products from their teachers, doctors, and lawyers. Errors in legal and some medical matters can be arbitrated in the courts. Errors in medical matters may result in harm to patients. Errors in education are more insidious (despite the Scopes Trail).
- Of course, this is also true of our personal world models. We should accept that our world models are best effort – based on a long history of knowledge and experience but faulty and biased – and worthy of reexamination based on meaningful new evidence.
- The best we can do is to recognize the limitations of our facts and knowledge and question or investigate those things that are most vital to the well being of those that we care for and to our world.
The good news is that evidence-based means are emerging to support the Doubt and Verify approach for most domains.
Doubt Everything: Enter Evidence-Based Reasoning
How does one verify (i.e., prove the probabilistic likelihood of) a potentially questionable fact or conclusion (theory), e.g., information from teachers or professionals? The answer was empiricism used to establish causality or WHY a phenomenon occurs. Following the erstwhile modern Scientific Method (a.k.a. the Third Paradigm of Scientific and Engineering Discovery) and appealing to community-accepted norms you conduct an experiment or a Randomized Clinical Trial (RCT) to verify the hypothesis in question by establishing that in all probability what circumstances produce the hypothesized phenomenon.
This too is changing, in two steps. First, in 2007 Jim Gray and others identified the emerging Fourth Paradigm [of Scientific and Engineering Discovery] (a.k.a. eScience)  in which massive amounts of data and computational power are used to identify highly probable hypotheses or trends. This is data-driven or evidence-based scientific and engineering discovery that identifies significant evidence that some phenomenon, i.e., WHAT, has occurred.
Second, in the intervening years, there has been an evolution of Jim’s great insight. The emerging answer is a combination of evidence-based analysis (i.e., WHAT established by Data-Intensive Analysis a.k.a. Big Data Analytics) used to identify highly probable hypotheses followed by “old fashioned” empiricism to establish highly probable causality (i.e., WHY the phenomenon occurred). Just as the Scientific Method has been applied to domains outside of conventional science and engineering, e.g., the social sciences and the humanities, the emerging Fourth Paradigm is being applied to every human endeavor.
Due to its likely pervasive impact, we better get this right. We have yet to establish that results of this emerging Fourth Paradigm - Data-Intensive Analysis – are (probabilistically) accurate within measures of correctness and completeness. This often overlooked verification and the associated correctness measures will take a decade or more. Recall that the Third Paradigm evolved over hundreds of years and still has significant issues (e.g., P-values, reproducibility).
With a focus on correctness and completeness I define:
Data Science is a body of principles and techniques for applying data-intensive analysis to investigate phenomena, acquire new knowledge, and correct and integrate previous knowledge with measures of correctness, completeness, and efficiency of the derived results.
Do Teachers and Professionals Have to Lie? For some time there has been a movement to apply evidence-based analysis to education  to evaluate outcomes of educational methods. Happily, evidence-based methods can go much further. Treat all knowledge - all facts and theories - as hypotheses requiring evidence-based evaluation relative to specific hypotheses or models – What: does the phenomenon really occur? and Why: what factors lead to the phenomenon?
Just as science has been taught empirically using experiments to explore scientific phenomena, so too can all topics, for which data is available, be verified using the emerging Fourth Paradigm in which hypotheses can be deduced from data then investigated empirically where conditions (i.e., ground truth) allow. However, the scale and variety of Big Data and the complexity of analytical models may require new measures of significance, correctness, and completeness, a 21st century statistics.
When a serious medical condition arises, it might be helpful for my doctor to find significant evidence for the prognosis and then significant evidence for successful treatment plans. When a serious scientific question arises, e.g., man’s impact on global warming, significant evidence for What and Why are critical. When serious historical or political issues arise, e.g., the social and economic impact of immigration laws such as currently discussed worldwide, significant evidence for hypothesized outcomes could lead to more informed public debate and action. These are examples of how Data-Intensive Analysis can transform education, medicine, and law, or more broadly our world.
In the end, there is no truth, no ultimate ground truth, no lie-free utterances, as everything is contextual based on incomplete facts and knowledge. All world models are flawed. The best we can do is to recognize their limitations and search where resources are warranted and available for sufficient evidence for those things most critical to the well-being of those we love and of our planet.
The only novelty in these ideas is in their application to Big Data and Data-Intensive Analysis. These ideas have been at the heart of epistemology for hundreds of years, by scientists under many terms, e.g., confirmation bias, and by psychologists also under many terms including Family of Origin. Yet, the outcome – the Fourth Paradigm - is novel, a new paradigm  that we do not yet fully understand. While we may believe that a tree is real, can we say that our reality is the true reality when we can sense less than one ten trillionth of the electromagnetic spectrum (see figure below)? If we sense it does that make it real or true? If we cannot sense it, does it exist? Ultimately, the message here is of how the Fourth Paradigm, supported by Data-Intensive Analysis, may change our world – most human endeavors. To get Data Science right we need to establish its fundamentals, hence the above definition.
Fig. 2. Humans can sense ONLY 10-13 part of the Electromagnetic Spectrum
For the past few years I have had the luxury of being able to think about Data Science in my professional and personal lives. Initially I was disillusioned to realize that while my mother, my teachers, and I believed that we had the greatest integrity and best intentions, i.e., we were right, my world model - my beliefs - were based, in part, on faulty learning and imperfect knowledge.
This transformed in my professional and personal lives to the excitement of exploration and discovery as I diligently applied doubt followed by verification of those things that matter to me. Therein lies the power of the Fourth Paradigm, of Data-Intensive Analysis and Data Science. When it matters, be open to question “known” facts and theories in your world model and use emerging evidence-based methods to meaningfully verify them and if need be seek new facts and theories with more significant supporting evidence.
Doubt and Verify–power tools for an amazing, emerging new paradigm for understanding our world.
- David Eagleman: Can we create new senses for humans? Ted2015, March 2015
- Jim Gray on eScience: a transformed scientific method, in Tony Hey, Stewart Tansley, Kristin M. Tolle (Eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research 2009 ISBN 978-0982544204
- Kuhn, Thomas S. The Structure of Scientific Revolutions. 3rd ed. Chicago, IL: University of Chicago Press, 1996.
- M. Guzdial. 2015. Bringing evidence-based education to CS. Commun. ACM 58, 6 (May 2015). DOI:http://dx.doi.org/10.1145/2783419.2754947
- Interview: Ravi Iyer, Ranker on Dealing with Inherent Bias in Crowdsourcing Data
- Interview: Josh Hemann, Activision on Why the Tolerance for Ambiguity is Vital
- The missing D in Data Science