The Data Science Delusion

Gleanings from observed technical misunderstandings between business leaders and data scientists (and among data scientists themselves) so dramatic that one could start wondering whether there is something wrong with data science as it is being practiced.

The Delusions

We can now locate the delusional circumstances in data science within the above landscape. Broadly, these delusions may be categorized as (i) between-quadrant effects: delusions due to a mismatch between the task and the people assigned to them and (ii) within-quadrant effects: delusions due to confusion intrinsic to a quadrant.

The illustrations cited are drawn from cases I have encountered, but sharpened and simplified to make them clearer.

Between-Quadrant Effects: The Delusion Matrix

Labeling tasks and resources (people) by the quadrant they belong to (Q1–4 tasks and Q1–4 resources) brings us to a confusion matrix, or what could be called a “delusion matrix”.

Rough representation (see text for details)
Figure 3: The Data Science “Delusion Matrix”

1. Lipstick on a Pig: Q3/Q4 tasks and Q1/Q2 resources

This effect manifests itself when a generalist, often inadvertently, steps out of his zone of competence.

Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts [14] and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).

Overheard: “Survival analysis? Never heard of it … Wait … There is an R package for that!”

A variety of machine learning algorithms can be deployed nowadays at the click of a button, producing at least superficially appealing outputs. Robust testing practices can raise a red flag when a move from Q1/Q2 to Q3/Q4 has been made, at which stage a more specialized data scientist can be brought in. It may be too much to ask, but if the Q1/Q2 resource could say “I don’t know” when he or she is in uncertain territory, a more timely intervention may be possible. There should be no disgrace in saying so, but unfortunately there is, for a data scientist by definition knows nearly everything.

2. The Tyranny of Low-Hanging Fruit: Q1/Q2 tasks and Q3/Q4 resources

A company hires specialized data scientists (Q3/Q4 resources) who expect to do at least some science, but all the business really needs, and has the data or appetite for, are simple heuristics or manual processes [23]. A bunch of frustrated data scientists is the result.

Illustration: A marketing team wants to rank prospects for a certain digital content subscription offering (lead scoring). The initial dataset, with around 50000 leads, has only two attributes: the amount of data used by the lead in a trial (in MBs) and the phone/device used (e.g., iPhone 6, iPad Mini). An analyst from the marketing team has fiddled with the ranges of the two attributes to come up with a model in Excel, which the head of marketing is thrilled with. This bliss is unaffected by warnings from the data scientists that the model is both brittle and non-scalable.

Overheard: “You want me to fit a model to these 12 data points?”

Of course there are times when a business only needs simple heuristics. And there are times when opportunity costs and time-to-market constraints mean that only simple heuristics can be afforded. Both the organization and data scientists need to be aware of such circumstances before the hiring is done. However, constraints or not, bad science remains bad science and has its own costs.

3. The Wonderful Wizards: Q4 tasks and Q3 resources

These situations generally occur when the company hires data scientists out of a fear or missing out rather than with a specific problem in mind [24]. A hands-off attitude — “here’s the data, now do some data science magic” — plagues these projects. When business managers fail to communicate crucial domain insights, the projects drag on in the exploratory stage for much longer than necessary, with the consequence that most of the initial enthusiasm for the project is lost. And if these managers do not participate in the exploration and definition of the problem, but are eager to play both Judge and Jury, one may be sure they are only waiting to don the Executioner’s robes.

Illustration: The asset management department of an organization wants to examine records of assets and inventory to identify anomalies — it is suspected that many records suffer from incorrect classification. For example, a desktop computer is occasionally classified as “office stationery”. The data science team is handed millions of records with the item name (something like “Dell Optiplex 2020”), description and classification. One obvious approach to the problem is to cluster the descriptions and look for classifications that seem out of place (another approach would be to search for item categories on the web, but this would work only when the item names are clean, and even then it could be quite complicated for smaller items). But the descriptions, as keyed in by the purchase assistant, have tremendous variety, with abbreviations and spelling mistakes in abundance (“desktop computer”, “dekstop”, “computer”, “pc”, “DT computer” etc.), and it’s easy to see how these could be confused with other item categories (“computer paper”, “desktop supplies”, etc.). The domain experts in the department neglect to inform the data scientists that all company items are linked through their item code to their prices, in another database which is accessible to them.

Overheard: “Just give them a dump of all the data. They will use data science to make the insights pop out”

The excessive mathematization of finance [25] and projects such as Google Flu Trends [22] can be looked at as high-profile illustrations of the myth of the wizards. These failures also show that the blame doesn’t always attach to the domain experts, but often to the data scientists who fail to question their own assumptions about a complex system (such as financial markets, the weather, or human behaviour).

The broader point is that exploratory projects often involve significant leaps of faith. Domain experts are best placed to judge whether these are reasonable.

Within-Quadrant Effects

4. Shibboleths: the Q3 Hodgepodge

The broad definition of data science has made teams much more heterogeneous, which is no doubt a good thing in many ways. The downside is that there are many misunderstandings due to the differing backgrounds. Firstly, disciplines differ in terminology: what is an observation model for one may be a sensor model for another; the term kernel may evoke a different concept for each; when a statistician mentions covariates an machine-learning scientist thinks of features, and when the former mentions hypothesis-testing the latter thinks of a quick escape. Secondly, disciplines differ in what they place emphasis on: system-building vs. prediction vs. inference, and so on [13]. These issues can affect not only day-to-day interactions but also hiring decisions.

Illustration: An economist applies to a data science team with a predominance of computer scientists. She has completed her postgraduate thesis on how certain trade policies affect the economy, followed by a couple of years of quantitative work in the industry. Her interview focuses almost entirely on how much coding effort her thesis involved, followed inevitably by a rejection.

Overheard: “I can code that entire thesis in two weeks”

5. Technology Giveth And Technology Taketh Away: Shrinking Q1

The opportunities created by technology are at risk of being eroded by the next stage of its evolution. Most Q1 problems can be solved today by push-button software (once the data is in the right place and in the right format). And as awareness of machine-learning techniques grows among business analysts and managers, greater automation will help them take over the data scientists’ shrinking role [26].

To what extent will data science be automated? Though opinion is divided on this question [2728], the consensus seems to be that expert-level (say Q3) roles will remain relevant for the next few years at least, though practitioners of deep learning may disagree [21].

Data scientists, whose jobs have been reduced to chaperoning a tool, may find it useful to reskill themselves.

6. They Get the Data, You Do the Science: Q1/Q3 Fragmentation

This delusion applies more to the computing skills dimension, which is not represented in the delusion matrix (these are the octants where strong computing skills are required but not too much domain or modeling expertise).

Once the realization sets in that hiring a unicorn data scientist is next to impossible, it is possible to go too far in the opposite direction: hire one person (or team) for data ingestion, one for data manipulation, one for data integration, one for data modeling, and finally one for statistical analysis and machine learning. While such compartmentalization may be essential for taking data science tools to production, it is downright harmful in the exploratory stage. If the hiring manager is looking for “Sqoop” and “Informatica” experts before there is a problem to be solved, he or she may be bringing back the very red tape which data and algorithm democratization were supposed to cut through [29]. The data scientist will spend half his time waiting for the right data to be ingested, and then waste most of the remaining time in reprocessing the data according to the requirements of the algorithm. The reason is that the tools for data science often need to be matched carefully to the task at hand. A wide range of tools and libraries (e.g., on top of Hadoop/Spark) may have to be explored before choosing the best one. Investing in the full cycle of production activities for each would be very inefficient. For instance, while processing time-series (TS) data, whether we expect to keep one TS in memory at a time or slices of many TS together would drive the choice of library.

A Personal Takeaway

It is possible to derive a set of dos and don’ts based on the above delusions (e.g., do not hire data scientists unless you have a problem to solve, do guard against too much homogeneity in your data science group, etc.), though I would hesitate to frame these issues in such stark terms outside of any context. Hopefully awareness of these delusions can help reduce some of the teething trouble that data science as a discipline faces in the industry.

As someone still struggling to navigate the data science journey, I think three points are worth stressing:

(i) To a researcher, data science brings wonderful opportunities to do interdisciplinary work at a much faster rate than usual. However, the skills demanded of a data scientist can only be honed over a long period of time [7]. While technological advances make it easy to be lulled into a false sense of expertise, the truth is that each domain, each subfield and each tool demands a period of internalization before a data scientist can handle them with confidence — vita brevis, ars longa.

(ii) There is plenty of work in the data science industry that demands Data Jugglery, Jugaad and Jujitsu [30], and not much else. These skills are extremely valuable, but unless they are accompanied by some core scientific work, a researcher should look at such jobs with suspicion.

(iii) Great data science work is being done in various places by people who go by other names (analyst, software engineer, product head, or just plain old scientist). It is not necessary to be a card-carrying data scientist to do good data science work. Blasphemy it may be to say so, but only time will tell whether the label itself has value, or is only helping create a delusion.

Thanks to various colleagues, past and present, with whom I have shared enlightening conversations on this topic.

[1] DJ Patil and Hilary Mason. Data Driven: Creating a Data Culture. O’Reilly,, 2015.
[2] Thomas H Davenport. Competing on analytics. Harvard Business Review, 84(1):98,, 2006.
[3] Drew Conway. The data science venn diagram., 2010.
[4] Karl Broman. I am a data scientist., 2016.
[5] DJ Patil/Chau Tu. 10 questions for the nation’s first chief data scientist., 2016.
[6] Sophie Chou, William Li, and Ramesh Sridharan. Democratizing data science. KDD, 2014.
[7] Vincent Granville. Fake data science., 2013.
[8] Robin Bloor. A data science rant., 2013.
[9] Gil Press. Data science: What’s the half-life of a buzzword., 2013.
[10] Sophie Chou. What can be achieved by data science., 2014.
[11] Karl Broman. Data science is statistics., 2013.
[12] Andrew Gelman. Statistics is the least important part of data science., 2013.
[13] Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statist. Sci., 16(3):199–231, 08 2001. doi: 10. 1214/ss/1009213726.
[14] Larry Wasserman. Data science: The end of statistics., 2013.
[15] Peter Naur. The science of datalogy. Commun. ACM, 9(7): 485–, July 1966. ISSN 0001–0782. doi: 10.1145/365719.366510.
[16] Peter Naur (Wikipedia)., 2016.
[17] Cathy O’Neil. Statisticians aren’t the problem for data science. The real problem is too many posers., 2012.
[18] Michael Mout. What is wrong with definition of data science. /2013/12/what-is-wrong-with-definition-data-science.html, 2013.
[19] Michael Hochster. What is data science (quora)., 2014.
[20] Michael Li. Two types of data scientists: Which is right for your needs?, 2015.
[21] George Leopold. Machine learning tools to automate data science., 2015.
[22] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The parable of google flu: Traps in big data analysis. Science, 343(6176): 1203–1205, 2014.
[23] Greta Roberts. Stop hiring data scientists if you’re not ready for data science., 2015.
[24] Yanir Seroussi. You don’t need a data scientist yet., 2015.
[25] Nicolas Bouleau. On excessive mathematization, symptoms, diagnosis and philosophical bases for real world knowledge. Real World Economics, 57:90–105, 2011.
[26] Barb Darrow. Data science is still white hot, but nothing lasts forever., 2015.
[27] Gregory Piatetsky. Data scientists automated and unemployed by 2025? /2015/05/data-scientists-automated-2025.html, 2015.
[28] Sandro Saitta and Nestle Nespresso. Data science automation: Debunking misconceptions. /2016/08/data-science-automation-debunking-misconceptions.html, 2016.
[29] Sunile Manjee. Pluralism and secularity in a big data ecosystem., 2015.
[30] DJ Patil. "Data Jujitsu: the art of turning data into product." O’Reilly Media, Inc.,, 2012.

Bio: Anand Ramanathan is a computer scientist specializing in natural language processing and machine learning. He has worked on data science and analytics products for a variety of domains, including finance, oil & gas, and language solutions.

Original. Reposted with permission.