Interview: Joseph Babcock, Netflix on Discovery and Personalization from Big Data

We discuss the steps involved in Discovery process at Netflix, impact due to multitude of devices, system generated logs, and surprising insights.

Twitter Handle: @hey_anmol

Joseph Babcock
is currently a Senior Data Scientist working on Discovery & Personalization algorithms and data processing at Netflix.

Before Netflix, he studied computational biology at The Johns Hopkins University School of Medicine, where his PhD research in the Department of Neuroscience employed machine learning models to predict adverse side-effects of drugs.

He also previously worked at Chicago-based Accretive Health as a Data Scientist, focusing on data related to patient billing and referrals.

Here is first part of my interview with him:

Anmol Rajpurohit: Q1. What are the typical steps involved in the Discovery process at Netflix, i.e. in helping users find the right content for their taste?

netflix-logoJoseph Babcock: Optimizing user discovery of content on Netflix, like many machine learning problems, has two major elements: trying to understand the patterns in the customer’s historical activities, and generalizing those patterns to future behavior.

The first part is where ‘Big Data’ processing, aggregation, and analysis are involved. By logging what content customers play, browse, and search for, we construct a profile of their interests, and perform exploratory analyses to examine which signals are more predictive than others for engagement (e.g., choosing to stream a program). We use promising signals from this analysis to prototype models offline and compare their performance against our current systems in an effort to identify potential hypotheses for improvement.

The second step involves experimentation: taking these hypotheses and testing if they successfully generalize in a head-to-head comparison with our existing algorithms by randomly assigning a subset of users to receive these alternative recommendations. Once we’ve evaluated whether a statistically significant difference exists between the performance of the old and new models (after controlling for potential confounding variables and looking for multiple points of evidence to support our conclusions), the cycle begins anew in a continuous process of innovation.

AR: Q2. Netflix is being watched today over a wide multitude of devices. How does this impact your task of Discovery and Personalization?

netflix-multiple-devicesJB: While the diversity of platforms affords our customers great flexibility in how they enjoy our product, it presents special challenges for our personalization efforts. For example, the website features an ‘instant search’ feature that generates results in real-time as users type queries, while Smart TV and mobile devices require you to explicitly input a search term; our algorithms need to accommodate both scenarios.

Similarly, our predictive models and the UIs that utilize them need to work on a mobile phone that can display only a few titles in a screen, as well as the large-screen TV in your living room. On the back end, this involves a lot of careful work to design flexible data schema in our warehouse that can accommodate inputs from multiple client platforms without obscuring important differences between them in user behavior.

AR: Q3. A great amount of Analytics at Netflix is based on the system generated logs. What are the main benefits and challenges of dealing with log data?

system-logJB: A major benefit of our system logging is that our investment in cloud infrastructure makes it possible to scalably incorporate new devices, platforms and geographic regions as we continue our global expansion. Because this data is usually supplied in the form of Javascript Object Notation (JSON), it is flexible for recording many kinds of data, and for changing what we are logging in an agile way.

This flexibility can also be a challenge, as the first step in most of our algorithmic data preparation is taking this unstructured data and transforming it into a more conventional row-column dataframe used for predictive modeling, which with nested data structures can require a lot of custom logic during Extract, Transform, and Load (ETL) jobs.

AR: Q4. What have been some of the most unexpected, surprising insights that you have obtained through analyzing log data at Netflix?

family-guyJB: I think we’re frequently surprised by the differences between data derived from user feedback (such as ratings or taste preference selections) and what interests are revealed by their actual behavior on the service. Many of our users claim to enjoy foreign films and documentaries; practical experience tells us that, come movie night, they might watch Family Guy instead.

Second part of the interview will be published soon.