Doing Data Science at Twitter
Data scientist career exciting, fulfilling and multidimensional career path. Understand through the journey of a data scientist of twitter about data scientists roles, responsibilities and skills required to perform them.
DS at early stage start-ups, growing start-ups, and those who achieved scale
One of the most common decisions to make while looking for tech jobs is the decision between joining a large v.s. small company. While there are a lot of good general discussions on this topic, there isn’t much information specifically for DS — namely, how the role of DS would change depending on the stage and size the company.
Companies at different stages produce data in different velocity, variety, and volume (the infamous 3Vs). A start-up trying to find its product market fit probably don’t need Hadoop because there isn’t much data. A growing start-up will be more data intensive but might do just fine using PostgreSQL or Vertica. But a company like Twitter cannot efficiently process all its data without using Hadoop and the Map-Reduce framework.
One important lesson I learned at Twitter is that a Data Scientist’s capability to extract value from data is largely coupled with the maturity of the data platform of its company. Understand what kind of DS work you want to get involved, and do your research to evaluate if the company’s infrastructure can support your goal is not only smart, but paramount to ensure the right mutual fit.
- At early stage start-ups: the primary analytic focus is to implement logging, to build ETL processes, to model data and design schemas so data can be tracked and stored. The goal here is focused on building the analytics foundation rather than analysis itself
- At mid-stage growing start-ups: Since the company is growing, the data is probably growing too. The data platform needs to adapt, but with the foundation laid out already, there will be a natural shift to insight generation. Unless the company leverages Data Science for its strategic differentiation to start with, many analytics work are around defining KPI, attributing growth, and finding the next opportunities to grow
- Companies who achieved scale: When the company scales up, data also scales up. It needs to leverage data to create or maintain competitive edge. e.g. Search results need to be better, recommendations need to be more relevant, logistics or operations need to be more efficient — this is the time where specialist like ML engineers, Optimization experts, Experimentation designers can play a huge role in stepping up the game.
By the time I joined Twitter, it already has a very mature data platform and stable infrastructure in place. The warehouse is clean and reliable, and ETL processes are processing hundreds of Map-Reduce jobs easily on a daily basis. Most importantly, we have talented DS working on data platform, product insights, Growth, experimentations, and Search/Relevance, along with way other focus areas.
I was the first dedicated Data Scientist on Growth, and the reality is, it took us a good few months before Product, Engineering, and DS converged on how DS can play a critical role in the process. Based on my experience working closely with the product team, I categorize my responsibilities into four general areas:
- Product Insights
- Data Pipeline
- Experimentation (A/B Testing)
Let me describe my experience and learning in each of these topics.
One of the unique aspects of working for a consumer technology company is that we can leverage data to understand and infer the voice and preference of our users. Whenever a user interacts with the product, we record useful data and metadata and store them for future analyses.
This process is known as logging or instrumentation, and is constantly evolving. Frequently, DS might find a particular analysis difficult to perform because the data is either malformed, inappropriate, or missing. Establishing a good relationship with the engineers is very useful here because DS can help engineers to identify bugs or unintended behaviors in the system. In return, engineers can help DS to close “Data Gaps” and to make data richer, more relevant, and more accurate.
Here are a few examples of product related analyses I performed at Twitter:
- Push Notification Analysis — How many users are eligible for push notifications? across user segment? across clients? What are the tap rates of different push notification types?
- SMS Delivery Rates — How do we calculate Twitter’s SMS delivery rates across different carriers? Are our delivery rates in emerging countries poorer? How can we make them better?
- Multiple Accounts — Why do certain countries have a higher ratio of multiple accounts? What drive people to create multiple account?
Analyses come in different forms — sometimes you are asked to provide straightforward answers to simple data pulls (push analysis), other times you might need to invent and come up with new ways to calculate a new but important operational metrics (SMS delivery rates), and finally you might be tasked to understand deeper about user behaviors (multiple accounts).
Generating insights through product analysis is an iterative process. It requires challenging the questions being asked, understanding the business context, and figuring out the right dataset to answer the questions. Over time, you will become an expert in where the data lives and what they mean. You will get better at estimating how much time it will take to carry out an analysis. More importantly, you will slowly move from a reactive state to proactive state and start suggesting interesting analyses that product leaders might not think of, because they don’t know the data exists or that disparately different data sources can be complementary and combined in a particular way.
Skill used here:
- Logging and Instrumentation. Identifying Data Gaps. Establish good relationships with engineers
- Ability to navigate and identify relevant datasets and how to use them
- Understand different types of analyses and get better at time estimates on how long or difficult they will take
- Know your query language. Typical data munging skills using R or Python
Even though type A data scientists might not produce codes that are directly user facing, surprisingly often, we still commit codes into the codebase for the purpose of data pipeline processing.
If you heard of the operation | (pipe) from Unix that facilities the execution of a series of commands, a data pipeline is nothing but a series of operations, when streamed together, helped us to automatically capture, munged, aggregated data on a recurring basis.
Before Twitter, most of my analysis are ad-hoc in nature. They are mostly run and executed once or few times on my local machine. The codes were rarely code reviewed, and they are most likely not version controlled. When a data pipeline is created, a new set of concerns start to surface such as dependency management, scheduling, resource allocation, monitoring, error reporting, and alerting.
Here is a typical process of creating a data pipeline:
- You realized that the world would be a better place if a dataset can be produced on a recurring basis
- Upon confirming the need, you started off by designing the final product first, such as designing the data schema of the output dataset.
- Write your code, either in Pig, Scalding, or SQL, depending on where your data lives.
- Submit for code reviews, and be prepared to get feedbacks and make additional changes because either your business logic is incorrect or your code was not optimized for speed and efficiency #shipit.
- There might be a step of testing and dry-running your job to make sure everything works as intended.
- Merge your code into master. Deploy the code and schedule your job!
- Set up monitoring, error reporting, and alerts in case things go awry
Obviously, a pipeline is more complex than an ad-hoc analysis, but the advantage is that this job can now be run automatically, the data it produces can be used to power dashboards so more users can consume your data/results. More importantly but a subtle point, this is a great learning process to pick up engineering best practices, and it provides the foundation in case you ever need to build specialized pipelines such as a Machine Learning Model (I will talk about this more in the last section) or a A/B testing platform.
Skills Used Here:
- Version control, the most popular by far is Git
- Learn how to do code reviews and provide feedbacks effectively
- Know how to test, dry-run, and debug when job fails
- Dependency management, Scheduling, Resource Allocation, Monitoring, Error Reporting, Alerting