Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department
An exploration of data science team building, with insight into why engineers should not write ETL, and other not-so-subtle pieces of advice.
Is it Really that Bad?
In truth, it depends what you’re looking to achieve. If you buy in to the argument above, then you have to accept that it has allowed companies to get by for many, many years… since the advent of BI. But I believe it’s a terribly inefficient model if you want your data science team to truly do more than build PowerPoint decks and dashboards.
The fundamental flaw that prevents the Thinker and Doer model from living up to its recruiting hype is the assumption that there exists an army of soulless non-mediocre Doer engineers who eagerly implement the ideas and vision of data scientists. Does that sound like the profile of any talented engineers that you know?
In this model the Doers are solely accountable for implementation, failure, and support of other people’s ideas, while the Thinker is rewarded for their success. This is at the heart of the contention and misalignment between the teams. It creates an IT group rather than an engineering team.
In order to attract talented engineers into a role like that, you need some really big scaling problems to serve as a distraction to the soulless, subservient role you have hired them into. You need the type of problems created by the existence of Big Data. And, I’m sorry, but you don’t have Big Data.
Instead, you will hire mediocre engineers. They will create tremendously over complicated messes. This will exacerbate the contention. Welcome to the Vicious Cycle. The end result is a team of data scientists who are empowered to be little more than report developers because they lack the support of a solid, innovative data platform. And if your recruiting hype had pitched them on the Report Developer role, they would have run the other way. After all, they’re Thinkers, not Doers!
A Different Kind of Data Science Department
Rather than try to emulate the structure of well-known companies (who made the transition from BI to DS), we need to innovate and evolve the model! No more trying to design faster horses...
A couple years ago, I moved to Stitch Fix for just that very reason. At Stitch Fix, we strive to be Best in the World at the algorithms and analytics we produce. We strive to lead the business with our output rather than to inform it. Unless you’re willing to boldly challenge and rethink the things you know to be substandard, that is a damn hard proposition to fulfill.
After seeing the department grow and develop over the last two years, I am confident to share what we are up to.
Given that the goal is to lead rather than to inform, I would like to propose to you what I believe is A Better Way to structure a data science department. A way that allows for autonomy in roles, true ownership all the way into production, and accountability for output. A way that is well suited for a company with a quickly evolving business (and data) model.
What follows is a blueprint for building a data science team that can pivot and react quickly, so as to lead and innovate through the production of thought-leadership, APIs, and code, rather than react to changes and throw together some PowerPoint presentations in a desperate attempt to redirect gut feelings and intuitions.
Enable Everyone to be Best in the World
Let’s forget the traditional roles, and instead think about the intrinsic motivations that get folks excited to come to work in the morning.
Regardless of role, a fundamental differentiator between adequate and great people lies in their desire and talent for being creative. Great people are able to identify and creatively solve problems that would absolutely baffle the mediocre. They excel in and crave for an environment of autonomy, ownership, and focus.
The assembly line handoff from scientist to engineer creates the polar opposite environment. (Truth is, even the Thinker resents having to rely on the Doer). The trick is to create an environment that allows for autonomy, ownership, and focus for everyone involved.
However, it is important to recognize that engineers and data scientists are impassioned by very different tasks:
Data Scientists:
Data scientists love working on problems that are vertically aligned with the business and make a big impact on the success of projects/organization through their efforts. They set out to optimize a certain thing or process or create something from scratch. These are point-oriented problems and their solutions tend to be as well. They usually involve a heavy mix of business logic, reimagining of how things are done, and a healthy dose of creativity. Thus, they require a deep understanding of how specific portions of the business operate and a high degree of partnership with business verticals.
Engineers:
Engineers excel in a world of abstraction, generalization, and finding efficient solutions in the places where they are needed. These problems are usually horizontally oriented in nature. They can be most impactful when applied broadly. They require a good overall understanding of how the business operates, but the abstracted nature of solutions mean they are light on business logic and do not require a heavy partnership with or deep understanding of verticals within the business.
Hybrid Thinker-Doers
A common fear of engineers in the data space is that, regardless of the job description or recruiting hype you produce, you are secretly searching for an ETL engineer.
In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.
Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL. It also means ownership of the analysis of the data and the outcome of the data science. The best-case outcome of many efforts of data scientists is an artifact meant for a machine consumer, not a human one. Rather than a report, dashboard, or PowerPoint presentation, it is some sort of algorithm or API that is integrated into the engineering stack – something that fundamentally changes the operation of the business. Autonomy means the data scientists own that code as well. All the way into production. They should be able to develop and deploy it without asking the permission of engineers, be accountable for support, be held to performance, latency, and SLA requirements, etc.
This puts vertical responsibility and focus squarely into the hands of data scientists. But, data scientists are not typically classically trained or highly skilled software engineers. You could say they are adequate, at best. So you would expect that they would create a Big Mess.
This is one reason why ETL and API / production algorithm development is typically handed off to an engineer in assembly line style. But, all of those tasks are inherently vertically (point) focused. Talented engineers in the data space are almost always best focused on horizontal applications.
So, then, what is the role of an engineer in this new, horizontal world? To sum it up, engineers must deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy (such as a tool, framework, or service used to build, schedule, and execute ETL). I like to think of it in terms of Lego blocks. Engineers design new Lego blocks that data scientists assemble in creative ways to create new data science. This is definitely easier said than done, but:
- Engineer’s work can be completely horizontal in nature. This allows them to focus on building technology that is broadly applicable across multiple data science problems. This maximizes leverage of engineering output. Which is great, since you probably have far more data scientists than you do engineers in your data science department.
- Engineers get to focus on what they do best: abstracting, generalizing, and creating efficient, scalable solutions where they are needed.
- Engineers get to operate with autonomy. The production from an engineering team deployed in this manner should look like “magic”. Things should just “fall into place” for the data scientists because their needs are anticipated and the scaling and resiliency is taken care of the platform, services, and frameworks they are using.
- In order for this to work well, most of the time the engineers need to anticipate the needs of the data scientists. They should be developing multiple steps ahead.
- For highly talented and creative engineers and data scientists, it’s a hell of a lot more fun.