How should I organize a larger data science team?

VP of Data Science is asking opinions on how should he organize a larger Data Science team.

By Dan Friedman, Expedia.

I manage a data science team at a fortune 500 company that has grown from a few people to over 40, with an aim to grow even larger in the next 12 to 18 months. We are primarily machine learning practitioners -- PhD level writing Python / Spark and using best practices tools like GBM and DNN. When we were small, we got by "scrappily" making due without dedicated admins an/or infrastructure team, doing everything ourselves or borrowing resource from engineering. We do have a few product managers that help manage requirements and interface with other teams.

Now that we are a bigger team, it doesn't make as much sense to do it this way. I was hoping that some people on this board could help me benchmark against what other teams in large companies are doing.

BTW, here are some pain points for us:

  • Maintaining data pipelines
  • Getting training algorithms to scale
  • Optimizing our AWS bill
  • Training new data scientists
  • Training engineers on related teams
  • Managing requirements from business users and other engineering teams
  • Making sure that the data scientists are innovators and not just order takers

Please comment below or reply to

Bio: Dan Friedman is a VP Data Science at Expedia, in Seattle, WA area.


Editor: Disqus had some problems, so moved this excellent comment below from Disqus to the body of the post

Terran Melconian

Here are my opinions, having been in some related situations:

Getting people who can scale ML algorithms beyond the current state of the art is fundamentally hard. This requires a deep understanding of computer science, machine learning, and math, and very high general intelligence. Demand exceeds supply and large tech companies bid aggressively for these people, sometimes with seven-figure compensation. Consider seriously what the ROI is for your business — does the incremental benefit that you will get relative to using a simpler algorithm, smaller training set , or accepting a slower cycle time have sufficient dollar returns?

Fortunately, getting people to build and maintain pipelines is not as hard. There are plenty of people who got a BS in CS and then got a masters or continuing-education certificate in ML and are now finding that a Ph.D. is expected for the jobs they actually want, so you should be able to hire some of them to do pipeline work instead. I would try to get a manager and/or tech lead with some actual data engineering experience, then build out the rest of the team with generalists willing to learn.

One thing that struck me is that you have a team of 40 people and you're overweight PhDs doing advanced, high-accuracy, expensive ML models. I think this may be related to your innovation issues. The skill set you've selected for is tuned for one stage in the project life cycle - when you've already determined that there is a business opportunity and that you have the data to predict something useful, and you want to maximize accuracy of those predictions. This is not the stage of the project life cycle where new opportunities are discovered. I think I would increase my weighting on people with strong creativity, business, domain knowledge, and statistics/EDA background, which is the skillset that's going to let you discover and get credit for new opportunities instead of optimizing opportunities that somebody else brings you.

Regarding managing requirements, one technique that has worked for me is to require an estimate - or at least a bound - on the dollar value that could be realized if the model performs according to favorable-case expectation, and prioritize projects based on that. It encourages "optimistic" estimates, but I found that even as an upper bound they're useful, and it has the advantage of being a fair and reasonably transparent process.

Training depends on whether you want people to know ABOUT what your team does, or whether you want them to be able to DO things themselves. For people who just need to know ABOUT the field, there's a good book from Provost and Fawcett which I recommend to managers that are going to be interacting with a data science team; it should work fine for engineers as well. You might organize a "book group" where somebody from your team meets with them for half an hour a week to discuss each chapter (and gently make sure they actually read it).

To impart actual usable skills, I have even more to say, as I've recently specialized in that, but I fear this comment is already getting longer than your original post. Feel free to get in touch if you'd like to hear more about that.