Data Engineer vs Data Scientist: the evolution of aggressive species
This article looks at how the two "species" - data scientists and data engineers - harmonise and coexist.
By Sébastien Corniglion, Data ScienceTech Institute
Before reading the following dialogues, it can be guaranteed that although anonymised, every one of these small “scenes” have happened in the last three years. They have in fact happened many times, on a daily basis, with most of the brilliant people I come across as applicants, students and alumni!
Me: “Why are you applying to a Data Science programme?”
Random Applicant (Maths Graduate): “I am fascinated by artificial intelligence and its potentials”
Me: “How are you with programming?”
Random Applicant (Maths Graduate): “Mmm, IT is not really my thing.”
Me: “Have you got a decent knowledge of probabilities and statistics?”
Random Applicant (IT graduate): “Sure, I’ve been learning Python to do Machine Learning”
Me: “Excuse me for not being quite clear, I’m talking about the mathematics here”
Random Applicant (IT graduate): “Oh sure, sorry. Yes, I’m also learning Matlab and R”
Me: “Assuming you have set-up a very good predictive model, mathematically sound, implementation tried-and-tested. Your model is production-ready, it will span throughout the whole organisation. But it’s computationally heavy and requires distributed computing. Assume that at step S, your model requires to compute a metric over the whole dataset. Which distributed system would you use?”
Random Applicant (holds a “Data something” position in an IT company): “Hadoop & Spark with MLlib”
Me: “Are you sure?”
Random Applicant (holds a “Data something” position in an IT company): “Yes, that’s what they are made for and they are open-source, no licence cost.”
Me: “Why are you barely turning up for your Amazon AWS classes?”
A Data Science student (engineering background): “I hate IT, it’s not for me.”
Me: “Let alone the market value of being certified, I promise that you’ll need the skills.”
The same student, turned recent alumni, 6 months later: “I’m setting-up a pilot AWS infrastructure for the company, our IT is just not ready for data science. I’m looking for a proper Data Science position.”
Four years after founding Data ScienceTech Institute, running two prototype cohorts and then four cohorts with a comparable MSc programme in Data Science, despite all our efforts, I’m forfeited: mathletes hate IT and geeks will never love maths.
Although we have strangely managed a productive harmony between the two “species” amongst our Faculty, the students, albeit their mature age (35 years-old on average) consistently turn into “maths monsters” who growl at anything which remotely suggests software engineering beyond the realms of R, Python and a bit of SAS and Matlab. Scala, providing the Spark cluster has been set-up for them, is just about acceptable.SQL is stomachednot because they have to, but because the relational model relies on a maths-solid theory.
But I’m starting to wonder whether I should equip the classrooms with protective fences when the Professors teach Amazon AWS or Software Engineering methodologies. We,for sure,keep the rabies vaccines nearby when the time comes to learn how to build a Hadoop / Spark cluster.
The strangest thing is to see people who were used to/ even liked IT in their previous jobs and engineers of all types, quickly mutate in the “beasts” described above. The mutation may even happen to a former software engineer!
And yet, the facts are there. As not all organisations have modern IT infrastructureslike the web-giants do, the typical first task of a Data Scientist is Data Engineering. But it is hated. It’s been written, said and told by many people in many publications and conferences. But as Data Science is coming out of labs in your “regular” organisations, it’s now time to think about “industrialising and urbanising” the processes, as well as setting-up clear human resources policies on which skills are actually needed.
The time for ridiculous, two-page long skill requirements in the job description for one person must be replaced with the right people bootstrapping the right shoes.
And looking at Indeed.com is mesmerizing. In India, searching for
[ "data" engineer big -analyst -scientist ]
jobs finds about 4.4 times more jobs than for“Data Scientist”.
In the United States, it’s about 3.5 times and in United Kingdom about 2 times.
Editor: I disagree with above analysis - looking at indeed.com jobs in USA with a search string "data engineer" (job title in quotes) finds about 2,500 jobs (as of May 14, 2018), while search for "data scientist" (job title in quotes) find about 4,300 or about 70% more jobs. Searching for [data engineer] without quotes, as the author of this blog did finds many more jobs but I think most of these jobs are for engineers or for other data-related positions.
However, the ideas in this blog post do not depend on the exact proportion of Data Scientist vs Data Engineer jobs.
But it’s not really a surprise though: “Data Engineers” are basically the “Computer Scientists” of the 21st century. They need to be good software engineers, to master infrastructures (on premises and cloud) and combine the lot in DevOps practices, and be decent in mathematics, for making the bridge with Data Scientists. And the latter is not new: good “computer scientists” were always meant to be at least decent in maths!
And as these two “species” will need to work in harmony on a daily basis, we decided to open a “Data Engineering” programme, where we will be having both “packs” collaborating on projects.
This time, I have good faith that we will bring peace to the (random) forest!
At this point, the pronunciation of the word “proper” merges either with a growl or a sigh.
If the reader comes either from academia or a large corporation with strong R&D labs, she will be surprised to know that our maths and IT teachers actually like each other, work together in harmony and so far, no member of one tribe has bitten one of the other!
One of the best terms ever: “any test or metric that relies on random sampling with replacement” in statistics and at least five different meanings in computing / software engineering, according to Wikipedia.
Test performed on the Indeed.com platforms for the cited countries, on May2nd, 2018.
This rather strange expression was crafted to try not mixing job offers including “data science” and focusing on the IT part of data engineering. The term “big” is added to make sure that we are not leveraging out-of-scope job offer, as without it, the results in India were somehow “strange”.
Bio: Sébastien Corniglion is one of the founders of Data ScienceTech Institute (DSTI), a private Higher Education, postgraduate-only school in Nice Sophia-Antipolis and Paris, France. He is the Dean of DSTI and a graduate from Université Nice Sophia-Antipolis and The University of Edinburgh.
- How Do I Get My First Data Science Job?
- Hedge Yourself From a Risky Data Science Job
- Machine Learning Engineer, Data Scientist among the best US Jobs in 2018