Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department
An exploration of data science team building, with insight into why engineers should not write ETL, and other not-so-subtle pieces of advice.
By Jeff Magnusson, Stitch Fix.
“What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting interviews for data platform engineers. It’s a fine question – one that, given the state of engineering jobs in the data space, is essential to ask as part of doing due diligence in evaluating new opportunities. I’m always happy to answer. But I wish I didn’t have to, because this a question that is motivated by skepticism and fear.
Why is that? If you read the recruiting propaganda of data science and algorithm development departments in the valley, you might be convinced that the relationship between data scientists and engineers is highly collaborative, organic, and creative. Just like peas and carrots.
However, it’s not a well kept secret that this is seldom the case. Most shops foster a relationship between engineers and scientists that lies somewhere in the spectrum between non-existent1 and highly dysfunctional.
A Typical Data Science Department
Most companies structure their data science departments into 3 groups:
- Data scientists: the folks who are “better engineers than statisticians and better statisticians than engineers”. Aka, “the thinkers”.
- Data engineers: these are the folks who build pipelines that feed data scientists with data and take the ideas from the data scientists and implement them. Aka, “the doers”.
- Infrastructure engineers: these are the folks who maintain the Hadoop cluster / big data infrastructure. Aka, “the plumbers”.
Data scientists are often frustrated that engineers are slow to put their ideas into production and that work cycles, road maps, and motivations are not aligned. By the time version 1 of their ideas are put into an A/B test, they already have versions 2 and 3 queued up. Their frustration is completely justified.
Data engineers are often frustrated that data scientists produce inefficient and poorly written code, have little consideration for the maintenance cost of productionizing ideas, demand unrealistic features that skew implementation effort for little gain… The list goes on, but you get the point.
Infrastructure engineers get frustrated with everyone for overloading the clusters and filling up disk space. They are kept at arm’s length from the scientists and engineers, which means they never gain a solid context into how the infrastructure is being used, or the business and technical problems that it needs to be used to solve. This makes them feel powerless to improve the situation. Instead, they react by making the infrastructure more restrictive. In turn, everyone becomes frustrated with them.
It’s a vicious cycle.
What Went Wrong?
We all know that the standard is substandard, and that the recruiting hype is just that. So, why don’t we fix it? Why does every data science and algorithms development team seem to slide into the same dysfunctional model?
I blame two things, offered here in the form of a couple observations:
You Probably Don’t Have Big Data
Data processing tools and technologies have evolved massively over the last five years. Unless you need to process over many petabytes of data, or you’re ingesting hundreds of billions of events a day, most technologies have evolved to a point where they can trivially scale to your needs.
Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
Everybody Wants to be the “Thinker”
Because it sounds like such a cool role! You get to sit around all day, think up better ways to do things, and then hand off your ideas to people who eagerly rush to put them into production. Go ahead, pitch it to somebody on the street. I bet they jump at the opportunity! Data scientists, especially those who are newer to the industry and don’t know any better, are especially vocal about desiring such a role.
That’s because we have trained them to desire it. We have bigger, more established companies to thank for that. Companies that had business intelligence departments before the Big Data craze…
A traditional business intelligence department consists of three roles: ETL engineers, report developers, and DBAs. ETL engineers move the data into the data warehouse. They are obsessed with Kimball and his guide to dimensional modeling. Report Developers, on the other hand, are folks who have made a career around designing reports in a specific tool (e.g. MicroStrategy, et al). They are specialists. DBAs (and a team of other tool administrators) do their best to just keep things running.
Here’s the thing. ETL engineers, Report Developers, and DBAs are all “Doers”. So, 10 years ago or so, when Big Data and data science started to become buzzwords, there were well-established BI departments who had plenty of Doers and not enough Thinkers. So, they made “Thinker” a role. We integrated data scientists with established BI departments by promising them the ability to fiddle around with data and change the course of the business. In reality, this didn’t happen. These data scientists occasionally manage to create some pretty cool and effective solutions, but by and large they focus on performing slightly higher level Report Developer-ing back to the business (which largely ignores their advice).
But the role sounds really nice, and it’s easy to recruit for. Thus was born the traditional, modern day data science department: data scientists (Report developers aka “thinkers”), data engineers (ETL engineers aka “doers”), and infrastructure engineers (DBAs aka “plumbers”).
Whoops. It would seem that the business intelligence department never really changed, we just added a Hadoop cluster and started calling it by a new name.