On Political Economy and Data Science: When A Discipline Is Not Enough
Most non-trivial Data Science applications are interdisciplinary requiring collaboration across disciplines. We are just beginning to understand the nature of interdisciplinarity in Data Science and the risks of misunderstanding.
By Michael L. Brodie, CSAIL, MIT
When I was an undergraduate student at Trinity College, Toronto (much like Oxford and Cambridge) my Don was a genteel, extremely intelligent Professor Ashley, an Oxford trained political economist and wise octogenarian who lamented the separation of politics from economics. At that time the separation seemed abstract. Now it seems critical.
During my recent travels in Russia and China I was immersed in the economics and politics from the present day back to Czarist Russia and in China back to the 21st century BC. A recurring, simple theme was that every major historical phenomenon (e.g., war, voyage of discovery, slavery, political movements) is political almost always in support of an economic objective, e.g., opening or protecting trade routes; obtaining resources including people, land, riches. Professor Ashley’s lament was crystal clear – politics and economics are entirely interdependent. Considering one without the other is short sighted and risks misunderstanding. According to the web “Political economy is the interplay between economics, law and politics, and how institutions develop in different social and economic systems, such as capitalism, socialism and communism.”
In the 18th century political economy became a discipline with leaders like Adam Smith at Glasgow University. While political economy remains a discipline at a small number of schools, including Glasgow, most schools separated politics and economics in the 19th century and treat political economy as a multi-disciplinary subject as opposed to a domain with multiple interdependent components, leading to learning more and more about less and less. Harvard’s Dept. of Economics has an interdisciplinary Political Economics group. Harvard’s graduate school has a PhD Program in Political Economy and Government but appears to generally treat politics and economics separately.
Data Science, my current passion, is multi-disciplinary in much the same sense. In Data Science you cannot do any of the core activities [data discovery, data curation, data management, data analytics (e.g., statistics, ML) or selecting the related algorithmic techniques (e.g., expectation-maximization, principal component analysis, clustering, gradient descent, 100s more), domain interpretation (genetics, biology, cosmology, … every human endeavor)] independently without risk of error.
While one might learn, teach, or research political and economic topics separately (as studies the data science components separately) their application must be interrelated. This has always been the case in political economy due to its inherent complexity. For the past half-century conventional databases have been used for business data processing that has relatively simple semantics compared with the far more complex (i.e., richer semantics) Data Science applications in real contexts such as particle physics, genetics, drug discovery, and medicine. One might argue that mistakes in business (inventories, billing, payroll) are not only non-critical or non life threatening and at small scale, but also they are extremely common, as I have seen in my 20+ years as chief scientist of a Fortune 20 company. Even those simpler semantics pose insoluble integration problems.
Solutions for realistic (non-trivial) Data Science applications are far more complex and errors have potentially far greater impacts, possibly life threatening. As in history, where politics, economics, and other aspects should be considered in an interdisciplinary fashion, so too in Data Science should data discovery, data curation, data management, analytics/algorithmic techniques, domain interpretation, and other aspects must be considered in some interdisciplinary fashion that has yet to be understood. (Note that the first operations in the data science pipeline refer to “data” whereas the latter analytical and interpretation operations typically deal with “evidence”. The essence of evidence can be destroyed when it is treated merely as data.)
As a data management person I see the role of data in business data processing is remarkably simple in contrast to the role of data in the far more complex contexts in non-trivial Data Science applications.
Over the past two years I have seen the interdependence of Data Science sub-disciplines in over 30 Data Science use cases all of which were at extreme scale (studied to see where things might break). A result that was non-intuitive for me was use cases where the data management team in curating and analyzing massive data sets negatively impacted the potential analytical results of the scientists for whom they prepared the data sets, e.g., producing spurious results and impacting the guarantees on confidence levels. Why? The answer concerns the emerging understanding of data reuse that involves statistics above my pay scale. This is precisely and example of Interdisciplinarity. While I do not understand the details of data reuse, in preparing data for analytical use I must understand it enough so as not to negatively impact the analyst’s task.
World leaders (Michael Jordan, Cynthia Dwork, and Laura Haas’ Accelerated Discovery team – Peter Haas, Scott Spangler, and Vitaly Feldman; and others – experts on statistics, data reuse, and collaboration) are just beginning to understand this interplay and dependence. Only a handful of data analysts – worldwide – are beginning to understand data reuse. Prof. Jennie Duggan, Northwestern University and I were (delighted to be) surprised and introduced to this by Vitaly Feldman, IBM’s Accelerated Discovery Lab. Whew! It is (for me) mathematically non-intuitive while being very important but informally obvious (The conservation of information and Hawking Radiation – an interesting related 40-year conundrum in cosmology.)
My very informal intuition is that an unexplored data set has only so much information to give. As you explore the data set, there is less (novel) information to be gained. Vitaly’s gift to us was an insight into the importance of Interdisciplinarity of which this was merely one example.
More generally I suspect that most significant issues are multidisciplinary (i.e., cannot be adequately understood using a single discipline) and require special interdisciplinary methods that one learns by doing, as did Scott Spangler, the PI on the amazing Watson-Baylor results. It seems unlikely that one person might be expert in all relevant disciplines so as to be able to solve these rich multidisciplinary problems. Hence, we must learn collaborative, interdisciplinary skills. A new, significant challenge for the data management community is to learn to collaborate across the disciplines that it serves. My use case studies suggest that these techniques are initially use case- or domain-specific, not necessarily transferable to other use cases or domains until we learn a great deal more. Presumably the same is true of the other sub-disciplines of Data Science.
With thanks to Vitaly Feldman for his comments and corrections, who made the following observation: “I imagine that the need for having both domain specific expertise and statistics/ML expertise to get the best results (in terms of extracting the most knowledge) is fairly well recognized. What is perhaps less well recognized is that one needs to be interdisciplinary to even to get valid results. And this is still before we add database curators to the loop.”
- C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A.L. Roth. “The reusable holdout: Preserving validity in adaptive data analysis,” Science, vol. 349, no. 6248, pp. 636–638, Aug. 2015.
- C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A.L. Roth. Preserving Statistical Validity in Adaptive Data Analysis. STOC ’15: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing In ACM Request Permissions. DOI: http://dx.doi.org/10.1145/2746539.2746580
- M. Nagarajan et al. 2015. Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature. SKDD 2015, ACM, 2019–2028. DOI: http://dx.doi.org/10.1145/2783258.2788609
- S. Spangler et al. 2014. Automated hypothesis generation based on mining scientific literature. SKDD 2014, ACM Press, 1877–1886. DOI: http://dx.doi.org/10.1145/2623330.2623667
- Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
- Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
- Surprising Random Correlations