Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department

An exploration of data science team building, with insight into why engineers should not write ETL, and other not-so-subtle pieces of advice.



So, the Scientists Do All The Work?

No, not at all. If anything, engineers have a much more challenging and demanding role than they do in the standard model. The data scientists probably do as well. We are not optimizing the organization for efficiency, we are optimizing for autonomy. What is offered is clear ownership of ideas and accountability for their delivery.

These are roles that are very attractive to folks who embrace an entrepreneurial mindset. It allows for quick movement, eliminates the need for building unnecessary consensus, and opens the door to disruptive innovation. But it does come at the cost of specialization, and thus efficiency.

The expectation, however, is not that data scientists are going to suddenly become talented engineers. Nor is it that the engineers will be ignorant of all business logic and vertical initiatives. In fact, partnership is inherent to the success of this model. Engineers should see themselves as being “Tony Stark’s tailor”, building the armor that prevents data scientists from falling into pitfalls that yield unscalable or unreliable solutions.

In the absence of abstractions and frameworks for rolling out solutions, engineers partner with scientists to create solutions. But, not in the form of a hand off. Rather, the engineering challenge becomes one of building self-service components such that the data scientists can iterate autonomously on the business logic and algorithms that deliver their ideas to the business. After the initial roll out of a solution, it is clear who owns what. The engineers own the infrastructure that they build, and the data scientists own the business logic and algorithm implementations that they provide. No form of tight coupling is required to iterate.

Data scientist

A Challenging Road

At this point, you may be skeptical that it’s possible to pull something like this off. However, I think the payoff is well worth the risk. Here are a couple of things to watch out for that can hamper or revert progress:

People are averse to change. People tend to want to recreate environments that they are used to working in. This creates pressure to revert to the Thinker-Doer model. New hires need to get on board with the new structure quickly. Vigilance is especially warranted when a project encounters problems – e.g. an API breaks or an algorithm serves bad results.

People behave in a very reactive way in those circumstances. They will insist that engineers should take over. But, they are addressing a symptom rather than the problem. Engineers should instead build in better platform support, visibility, abstractions, and resilience. And, they should realize that engineers break things too… no one is immune from making a mistake and breaking production.

It is absolutely essential for platform engineers to stay ahead of the data science teams. You need very sharp platform engineers who can make intuitive decisions about what services, frameworks, and capabilities need to be in place before they are desperately needed. The lack of hand off from scientist to engineer means that the engineers do not get the luxury of reacting to requirements delivered by the scientists.

Remember, the engineers are creating lego blocks, and the data science teams are assembling them. If the data science teams don’t have to right blocks to assemble, they will forge ahead nonetheless to create a solution. They’ll solve their problem either by assembling the wrong blocks (square peg in a round hole), or by creating their own. Usually, they will create a Big Mess. One that is hard to undo once it has been created.

Don’t Fear Inefficiency

A consequence of empowering data scientists to take on such a breadth of the stack is that they will be unlikely to produce code and solutions that are as technically efficient as an engineer’s. We are sacrificing technical efficiency for velocity and autonomy. It is important to recognize this as a deliberate trade off.

There is, however, a set of less obvious efficiencies that are gained with end-to-end ownership. The data scientists are experts in the domain of the implementations they are producing. Thus, they are well equipped to make trade offs between technical and support costs vs. requirements. For example, they can decide to sample data in certain places, use approximate methods where they make sense, and make decisions to nix or punt features that may produce only very marginal business impacts but come with extremely high development or support costs. These things seldom happen (and when they do, usually require numerous negotiations) in the assembly line model of hand off between scientists and engineers.

In aggregate, it is hoped that the benefits of autonomy and the innovation that can be produced as a result will outweigh the technical inefficiencies of the lack of technical specialization in allowing data scientists to own their full stack.

The Future

I’ll make no claim that we have discovered the best way to structure a data science department, or that this is the best structure for your organization. But, it is definitely not an attempt to build a faster horse and I feel strongly it is a better solution for Stitch Fix.

It’s my sincere hope that in sharing what we have done that it will encourage others with a non-traditionally structured department to do the same, inspire leaders of data science departments that are in a formative stage to think outside the box and find the courage to challenge tradition, and inform engineers and data scientists who are frustrated by traditional roles that there are different types of environments available to operate in.

1 Ask an engineer interviewing you where the data scientists sit (or vice versa). If they don’t know, don’t walk… Run.

Bio: At Stitch Fix, Jeff Magnusson leads the team responsible for creating robust and scalable platforms and services used by data scientists to blend art and science in a way that creates a deeply personalized experience for the company’s clients. Prior to Stitch Fix, he lead the Data Platform Architecture team at Netflix, where he helped design much of their batch compute platform running on AWS.

Original. Reposted with permission.

Related: