Doing Data Science at Twitter


Data scientist career exciting, fulfilling and multidimensional career path. Understand through the journey of a data scientist of twitter about data scientists roles, responsibilities and skills required to perform them.



Experimentation (A/B Testing)


Right at this moment, it’s very possible that the Twitter app you are using is slightly different from mine, and it’s entirely possible that you actually have a feature that I do not see. Under the hood, since Twitter has a lot of users, it can direct a small % of its traffic to experience a new feature that is not yet public, so to understand how these specific users react to it compared to those who don’t (the control group) — this is known as A/B testing, where we get to test which variant, A or B, is better.

I personally think A/B testing is one of the unique perks of working for a large consumer tech company. As a Data Scientist, you get to establish causality (something really hard to do with observational data) by running actual randomized, controlled experiments. At Twitter, “It’s rare for a day to go by without running at least one experiment” — Alex Roetter, VP of Engineering. A/B testing is ingrained in our DNA and our product development cycle.

Here is the typical process of running a A/B test: Gather Samples -> Assign Buckets -> Apply Treatments -> Measure Outcomes -> Make Comparisons. Surely this sounds pretty easy, no? On the contrary, I think A/B testing is one of the most under-appreciated and tricky analytics work, and it’s a skill that’s rarely taught in school. To demonstrate my point, let’s revisit 5 steps above again and some of the practical problems you might run into:

  • Gather Samples — How many samples do we need? How many users should go into each bucket? Can can we ensure that the experiment will have sufficient power?
  • Assign Buckets — Who are eligible to be in the experiments? and where in the code should we start assigning buckets and showing treatments? Would the placement introduce data dilution (i.e. some users are assigned to treatment but never see it)?
  • Apply Treatment — Are there any other teams in the organization running experiments that are competing for the same real estate in the app? How do we deal with experiment collision and ensure our data is not contaminated?
  • Measure Outcome — What is the hypothesis of the experiment? What are the success and failure metrics of this experiment? Can we track them? and How? What additional logging do we need to add?
  • Make Comparisons — Suppose we see that the # of users who logged-in increase dramatically, is it due to noise? How do we know if the results are statistically significant? Even if it is, is it practically significant?

Addressing each and every question from above requires a good command of Statistics. Even if you are as rigorous as possible when designing an experiment, other people might fall short. A PM would be incentivize to peek the data early, or to cherrypick the results that they want (because it’s human nature). An engineer might have forgotten to log the specific information we need to calculate the success metric, or the experiment codes can be written in the wrong way and unintended bias is introduced.

As a Data Scientist, it’s important to play the devil’s advocate and help the team to be rigorous, because time wasted on running a ill-designed experiment is not recoverable. Far worse, an ill-informed decision based on bad data is far more damaging than anything else.

Skills Used Here:

  • Hypothesis Testing: Statistical test, p-values, statistical significance, power, effect size, multiple testing
  • Pitfalls of Experimentation: Carryover effect, metrics cherry-picking, data dilution, bucket anomaly

Predictive Modeling & Machine Learning

My first big project at Twitter was to augment a set of fatigue rules to our existing email notification product so to reduce spams to our users. While this is a noble gesture, we also know that email notification is one of the biggest retention levers (we know this causally because we ran experiments on it), so finding the right balance is the key.

With this key observation, I quickly decided to focus on trigger based email, email types that tend to burst arrive at users’ email inbox when interactions happen. Being an ambitious new DS who is trying to prove his value, I decided to build a fancy ML model to predict email CTR at the individual level. I marched on and aggregated a bunch of user level feature in Pig and built a random forest model to predict email click. The idea is that if a user has a consistent long history of low CTR, we can safely holdback that email from that user.

There was only one problem — all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not “productionized” and the infrastructure cannot talk to my local model.Hard lesson learned!

A year later, I found a great opportunity to build a churn prediction model with two other DS from Growth. This time around, because I had accumulated enough experience building data pipeline, I learned that building a ML pipeline was in fact quite similar — There is the training phase, which can be done offline with periodic model updates through Python; There is the prediction part, where we aggregate user features daily, and let the prediction function does its magic (mostly just dot product) to produce a churn probability score for each user.

We built the pipeline in a few weeks, confirmed that it has good predictive power, and rolled it out by writing the scores to Vertica, HDFS, and our internal key-value store at Twitter called Manhattan. The fact that we make the scores easily query-able by analysts, DS, and engineering services helped us to evangelize and drive use cases of our model. And it was the biggest lesson I learned about building models in production.

I deliberately ignore the steps one needs to take in building a ML model in this discussion so far — framing the problem, defining labels, collect training data, engineer features, build prototypes, and validate and test the model objectively. These are obviously all important, but I feel that they are fairly well taught and many good advices have been given on this very subject.

I think most of the brilliant DS, especially type A DS have the opposite problem, they know how to do it right but are not sure how to push these models into the ecosystem. My recommendation is to talk to the Type B Data Scientists who have a lot of experience on this topic, find the set of skills that are needed, and find intersections and hone your skills so you can pivot to these projects when the time is right. Let me close this section by the following quote:

“Machine learning is not equivalent to R scripts. ML is founded in math, expressed in code, and assembled into software. You need to be an engineer and learn to write readable, reusable code: your code will be reread more times by other people than by you, so learn to write it so that others can read it.” — Ian Wong, from his guest lecture at Columbia Data Science class

Well said.

Skills Used Here:

  • Pattern Recognition: Identifying problems that can be solved using modeling technique
  • All the basics of modeling and ML : Exploratory data analysis, building features, feature selection, model selection, training/validation/testing, model evaluation
  • Productionization: all the things mentioned above about data pipeline. Set up your ouput so it can be query by different people and services

Some After Thoughts ….

Being a Data Scientist is truly exciting, and the thrill of finding a particular insight can be as exciting as an adrenaline rush. Building a data pipeline or ML model from ground can be deeply satisfying, and there are a lot of fun ‘playing God’ when running an A/B tests. That said, this road ain’t easy and pretty, there will be a lot of struggles along the way, but I think a motivated and smart individual will pick these things up quickly.

Here are some additional information I found very useful along the way, and I hope you find them useful too:

Data Science and Software Engineering:

A/B Testing:

Recruiting:

It’s a long journey, and we are all still learning. Good luck, and most importantly, have fun!

Original.

Related: