Data Science is Not Becoming Extinct in 10 Years, Your Skills Might

4 reasons why data science is here to stay and what you need to do to ensure that your skillset stays in demand.



By Ahmar Shah, PhD, Scientist, Academic (Data Science in Healthcare)



Photo by michael podger on Unsplash

 

As someone working in data science for over a decade, it is frustrating to see people prophesying on how the field will get extinct in 10 years. The typical reason given is how emerging AutoML tools will eliminate the need for practitioners to develop their own algorithms.

I find such opinions especially frustrating because it dissuades a beginner from taking data science seriously enough to excel in it. Frankly, it is a disservice to the data science community to see such prophecies about a field where the demand is only going to increase even further!


Why would any sane person invest their finite time and energy in learning something that will become extinct soon?


Let me tell you something. If there is one field where you have the best chance of truly retiring in, it is data science, Period. I will give you four key reasons why data science is not getting extinct anytime soon. I will also then give you my advice to ensure that you stay on the right side of data science in 10 years.

Data science will not get extinct but if you don’t keep pace with it, your skillset might. Let’s dive in.

 

1. Data Science Has Been Around for Centuries

 
Let’s start with science. I don’t have to convince you that science has been around for centuries. The essence of science is learning from data. We observe things in the world (collect data) and then we create a model (traditionally called a theory) that can summarise and explain those observations. We create these models to help us solve problems.

The essence of data science is precisely the same. Collecting data, learning from it by creating models, and then using those models to solve problems. Over years, various disciplines have developed and refined several tools that do this. Depending on what the focus of the field is, different names have been used to describe this set of tools and procedures. The one term that has currently gained a lot of traction is data science.

However, the difference between previous times and now is the volume of data and the computational power available to us. When we had a few data points and only a few dimensions, it was manually possible to put them down on a paper and fit a straight line (a regression) to it or identify patterns. Now, we can cheaply collect vast amounts of data from multiple sources (multi-features). It’s not humanly possible or feasible to fit a straight line (or cluster) when you have a very large number of data points and dimensions.


If the practice of collecting data, and developing models to explain it has been around for centuries, why do you think it will become extinct in the next 10 years?


If anything, we will collect even more diverse kinds of data, and we will need new ways of combining them creatively to solve problems.

 

2. Developing Models is a Very Small Part of a Real-World Project

 
Several tools under the umbrella of “Automated Machine Learning” are gaining traction and some of them will likely lead to the democratization of data science. However, most of such tools will help accelerate the testing and implementation of different algorithms on cleaned data inputs.

But the ability to get clean data into a model is not trivial at all.

In fact, several data science-related surveys have pointed out the disproportionate amount of time that any data scientist spends on collecting and cleaning data. As an example, the annual survey by Anaconda (one of the leading distribution used by data scientists) stated that data scientists spend 66% of their time in data loading, cleaning, and visualization and only 23% of their time in model training, selection, and scoring. My personal experience of working in the field for over a decade is similar.

Learning how algorithms work under the hood and understanding their nuances is not trivial at all, and many online courses rightly spend time explaining these. However, such a focus on algorithms only creates a false illusion as if data science is all about models. Many experienced practitioners are beginning to see the over-emphasis on models at the expense of data cleaning. Andrew Ng (a leading expert in the field) has been encouraging the data science community to move to a data-centric approach as opposed to a model-centric approach that most of us currently take in data science projects. In his deeplearning newsletter, he states:


It’s a common joke that 80 percent of machine learning is actually data cleaning, as though that were a lesser task. My view is that if 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.


This situation is further exacerbated by websites like Kaggle where participants are provided with clean data and the task is confined to developing different models with the aim of maximizing pre-identified performance metrics. (Kaggle is awesome for what it is for!)

A real-world project deals with several issues that just don’t start from carefully cleaned data or a defined problem. In most projects, we, a priori, wouldn’t necessarily know what features are going to be relevant, how frequently to collect data, and what is the right question that needs to be answered. Welcome to the Real World!

The emergence of new automated tools will continue to make the implementation of different models easy and accessible. However, it wouldn’t be able to sort the issues that are more challenging in real-world projects. A lot of such issues are context-dependent and not ripe for automation.

 

3. Real-World Data Science Projects Need Iterative Development

 
Perhaps driven by the hype around data science, I have been in situations where people have approached me telling me that they have data and would want me to apply “data science” to solve their problem (that may not necessarily be clearly defined either). I bet many people who are not data scientists view it to be some kind of magic (a tool that you can use with inputting data on one side and getting output at the other end).

Far from it, real projects have trade-offs that need to be balanced. This requires an iterative approach where an initial model is first deployed, and then the performance monitored as more data gets collected for further refinement.

Any deployed model is only useful if it is used as intended. This isn’t guaranteed. There needs to be a skilled human element that can continue to monitor and diagnose the use of the deployed model and come up with appropriate solutions to refine it. However, the monitoring part is not necessarily going to be automated or even quantitative. Very unexpected and weird things could happen that you may not predict.

 

The London Metropolitan Facial Recognition System

 
Not long ago, London’s Metropolitan Police tested a real-time facial recognition system. The system had cameras that would scan people in shopping malls and public squares, extract various facial features, and then compare these with suspects from a watch list. The system would then display any matches for the officers to review and decide if any suspect needs to be stopped (and in some cases, arrested). An independent report on the operation of the system raised significant concerns and highlighted several limitations. Out of the 42 suspects identified over 6 trials, only 8 (a meager 19%) turned out to be correct matches.

There are numerous, documented examples of data science algorithms being biased that render them inadequate and in need of further development. As things stand, we are not even at a stage where models are widely deployed and used. We, therefore, do not even have enough use cases of models drifting or going awry to further automate such tools. The best we have so far is to identify issues when models are deployed (e.g. bankinghealthcarepolicing).

This is state-of-the-art. We develop and deploy models but they turn out to be inadequate and not fit for purpose. We are at a stage where we are only seeing the early consequences of using models that are not appropriate. Is there any automated solution to deal with this yet? None!

Even manually, we are being challenged!

 

4. Data Science is SCIENCE for a Reason

 
This is my favorite point. Mundane, repetitive, non-cognitively demanding tasks have been at risk of automation for a while now. However, such disruption has only led to more jobs that require human creativity and problem-solving. Our memories suck but we, humans, are extraordinarily remarkable when it comes to identifying patterns to solve problems.


“Your mind is for having ideas, not holding them.” David Allen


Data Science is science for a reason. It’s about solving problems. Problems that we face and require creative, ingenious solutions. We shine at precisely that, a deeply desirable skill. The use cases of data science are only going to increase. This is simply because we are collecting more data and we have more computational power to implement complex mathematical operations on small chips.

Let me show you how ridiculously trivial it is to implement the most well-known machine learning algorithms these days.

Imagine you already have carefully cleaned input variable (X) and output variable (Y), ready to get into a model. Using Scikit-learn (a well-known, open-source machine learning library in Python), we can implement decision trees with the following two lines of code:

from sklearn import tree
tree.DecisionTreeClassifier.fit(X,Y)


We can implement Support Vector Machines with the following two lines of code:

from sklearn import svm
svm.SVC.fit(X,y)


Do you see the pattern? All we need to do is change the function name and there, you have the model. Real data scientists aren’t sitting and re-implementing these algorithms from scratch. They will end up using a mature library such as Scikit-learn in the industry.


But do you really think that most data scientists are doing this, and getting hired for this skill? Changing one word in the model, and hitting run and then reporting results? NO!


However, if this is all you focus on, as a data scientist, then it wouldn't be long before the demand for this skillset alone becomes extinct.

Implementing a model is something that most people could do if they know the tools and it is easy to get people trained. The hard parts are:

  • knowing when to use a certain tool
  • why a certain tool doesn’t perform well
  • what steps may help improve performance
  • what trade-offs are important in a given problem
  • insight and ability to link all the above with the overall objective
  • having the communication skills to engage with domain experts

The aforementioned skills are acquired by working on real-world, challenging projects. They take time and the learning journey is cognitively demanding. However, such skills will increasingly be more important as we collect even more data and face unique industry-specific challenges with more competition on the horizon (not less!).

The skills I listed above pertain to the timeless domain of problem-solving and creativity. These skills will continue to be highly sought-after because they can’t be automated.

 

Final Thoughts

 
You should by all means have a go-to tool that you learn, become proficient at, and understand the ins and outs as you get more experience. However, make sure that you avail opportunities that let you work on challenging projects where you can exercise your creative and problem-solving skills.

Quit worrying about data science getting extinct anytime soon. Such worries will only distract you from enjoying your journey and you will approach the field with half-hearted conviction. If you fall for such doomsday prophecies, you will fail to avail promising opportunities leaving your skillset to stagnate. And indeed, your demand will get extinct!


“Whether you think you can, or you think you can’t, you are right.” Henry Ford


However, if you continue to work on challenging data science projects (from data collection to model deployment), you will be on the right side of the field in 10 years and your demand will only increase!

The choice is yours. ????

 
Bio: Ahmar Shah, PhD is a scientist and academic. Ahmar leads an academic group based in Usher Institute, Edinburgh Medical School, University of Edinburgh focussing on data-driven innovation in medicine.

Original. Reposted with permission.

Related: