How to Grow Your Own Data Scientists

How Zynga is “home growing” its own data science talent from the inside, by retraining some of our top analysts and engineers to become data scientists.



By Amy Gershkoff, Chief Data Officer, Zynga.

By 2018, globally, demand for data scientists is projected to exceed supply by more than 50%.  To close the data science talent gap, many organizations recruit aggressively, offering above-market compensation and lucrative perks.  Nevertheless, recruiting and retaining data science talent remains a significant challenge for most organizations.

In response to this challenge, at my company, I’ve led the adoption of a new strategy:in addition to trying to buy data science talent from outside our organization, we’re “home growing” our own data science talent from the inside, by retraining some of our top analysts and engineers to become data scientists.

data-scientist

The advantages of the program include:

  • Retain proven talent, by offering career growth
  • Create data scientists in a market where they are scarce and difficult to hire
  • Existing analysts and engineers are already familiar with the organization’s data, infrastructure, and business problems

Here’s how we structured the program:

The Application Process

To be accepted into the program, employees need minimum of twelve months in current role, approval from their manager, and an above average rating for the two previous performance review cycles.  Additionally, candidates need a minimum of two previous semesters of coursework in statistics, economics, computer science, or similar.  Candidates then write an essay describing their interest.

Phase I: Foundational Statistical Theory

In Phase I, participants learn the basics of probability theory and statistical analysis in an academic environment.  Key modules for probability theory include sampling theory, hypothesis testing, and statistical distributions.  For statistical analysis, topics include correlation, standard deviations, and basic regression analysis, among others.  Usually one to two semesters of an online statistics course (such as Princeton University’s online course) covers this material.

Phase II: Foundational Programming Skills

To be an effective data scientist, knowledge of scripting languages such as Python and R is key. In this module, each program participant engages in self-study of Python and R, using books (e.g., Mark Lutz’s Python book) or online courses (e.g., Johns Hopkins R Programming course).

The participant then has the opportunity to apply these skills via a series of five assignments, under the supervision of a Senior Data Scientist.  These assignments are real business problems facing the organization, but for which the deadline is flexible, allowing time for the participant to practice these skills in a live business setting.

Phase III: Machine Learning

In Phase III, participants learn both supervised and unsupervised learning techniques.  Supervised learning techniques include decision trees, Random Forrest, logistic regression, Neural Networks, and SVM.  Unsupervised learning techniques include clustering, principal components analysis, and factor analysis.

To complete the module, participants first need to complete at least one machine learning course (e.g., Stanford University’s online course) covering all of the topics above.  Then program participants leverage their skills in two data science projects under the supervision of a Senior Data Scientist.  Again, these projects are actual business challenges, allowing the participant to practice these skills in a live setting.

Phase IV: Big Data Toolbox

It is important for data scientists to not only learn the necessary algorithms, but also to learn how those algorithms need to be adapted for large datasets.  For this reason, basic knowledge of tools such as Hadoop, Spark, and Revolution R constitute a dedicated module.

Participants gain knowledge about these key infrastructural aspects of Big Data through books (such as Agile Data Science (with Hadoop) or Advanced Analytics with Spark) and a series of self-learning modules we developed internally.

Participants then have the opportunity to practice these skills by building three data science models in a Hadoop environment and productionalizing these models in Revolution R or similar system.

Program Completion

data-scientist-heroUpon completing all four phases (typically twelve to eighteen months), the participant is officially considered to be an Associate Data Scientist and becomes engaged full-time on data science tasks.

Naturally, taking online courses and engaging in self-learning should not be viewed as equivalent to enrolling in a data science degree program offered by a traditional university.  But for organizations looking to expand their talent pool of data scientists, and for analysts or engineers seeking career growth, data science apprenticeship programs of this nature offer a “win-win” for both employers and employees.

Bio: Amy Gershkoff @amygershkoff, is the Chief Data Officer at Zynga and an Adjunct Professor at the University of California, Berkeley, where she teaches data science.