Big Data Analytics at Netflix: Interview

In the interview Kalantzis and Brown comment on the lessons learned in deploying Cassandra in a production EC2 environment at Netflix, what was the result of their experiments with MongoDB, and more.

ODBMS Blog, Roberto Zicari, Feb 2013.

NetflixNetflix, Inc. (NASDAQ: NFLX) is an online DVD and Blu-Ray movie retailer offering streaming movies through video game consoles, Apple TV, TiVo and more.

Last year, Netflix's had a total of 29.4 million subscribers worldwide for their streaming service.

I have interviewed Christos Kalantzis , Engineering Manager - Cloud Persistence Engineering and Jason Brown, Senior Software Engineer both at Netflix. They were involved in deploying Cassandra in a production EC2 environment at Netflix.


Q3. Why did you choose Apache Cassandra (C*)?

Kalantzis, Brown: There's several reasons we selected Cassandra. First, as Netflix is growing internationally, a solid multi-datacenter story is important to us. Configurable replication and consistency, as well as resiliency in the face of failure is an absolute requirement, and we have tested those capabilities more than once in production! Other compelling qualities include being an open source, Apache project and having an active and vibrant user community.

Q6: What are the typical data insights you obtained by analyzing all of these data? Please give some examples. How do you technically analyze the data? And by the way, how large are your data sets?

Kalantzis, Brown: All the data Netflix gathers goes towards improving the customer experience. We analyze our data to understand viewing preferences, give great recommendations and make appropriate choices when buying new content.
Our BI team has done a great job with the Hadoop platform and has been able to extract the information we need from the terabytes of data we capture and store.

Q9. What other methods did you consider for continuous availability?

Kalantzis, Brown: We considered and experimented with MongoDB, yet the operational overhead and complexity made it unmanageable so we quickly backed away from it. One team even built a sharded RDBMS cluster, with every node in the cluster being replicated twice. This solution is also very complex to manage. We are currently working to migrate to C* for that application.

Read more.