Interview: Debora Donato, StumbleUpon on the Secret Sauce of Impressive Content Curation
We discuss the role of data science at StumbleUpon, the shift from search to discovery, metrics for user engagement, the art of collaborative filtering, how native ads improve user experience, major trends, advice and more.
Debora obtained a Ph.D. in Computer Engineering in 2005 from the University of Rome "La Sapienza". She has published more than 50 scientific papers and she has been serving on the program committee of top tier conferences in the area of Data Mining and Information Retrieval.
Here is my interview with her:
Anmol Rajpurohit: Q1. What does StumbleUpon do? What role does Data Science play in it?
Debora Donato: StumbleUpon is the easiest way to discover new and interesting things across the web. Over 30 Million people turn to StumbleUpon across desktop and mobile to be informed, entertained and surprised by content that is recommended based on declared interests and activity.
Partners use StumbleUpon to distribute their content to influential and socially-engaged audiences by maintaining active accounts, sharing links, employing StumbleUpon badges, and creating StumbleUpon Lists.
The Data Science team works behind-the-scenes to understand the millions of people who use StumbleUpon. Gained insights are then leveraged to further our business objectives and help other teams across Advertising, Personalization, Site and Community understand how our users are interacting on StumbleUpon.
AR: Q2. During your presentations, you often asserted that "there is an ongoing shift from search to discovery". What do you mean by it?
In the last decade, traditional Information Retrieval has been challenged by the democratization of publishing as a result of the ever-evolving Internet. Curated and not-curated content floods users through the most diverse formats: articles, tweets, blogs posts, and therefore searching for relevant information via traditional search engines is not sufficient anymore. We need to propel a shift from information search to information flooding: a scenario in which all users’ activity is “the query” that allows relevant information to be routed directly to them.
StumbleUpon’s personalized recommendation engine is an attempt to solve this problem. Relevant sources are selected and shown to users based on their declared interests, explicit ratings and other implicit signals. Users join StumbleUpon to be entertained, to learn but also to discover useful resources.
AR: Q3. How do you define User Engagement? What approach do you follow at StumbleUpon to quantify, measure and monitor user engagement?
DD: To study and quantify User Engagement, I took inspiration from recent work in the field. WWW 2013 Tutorial on “Measuring User Engagement” brought to my attention the work of Attfield et all. 2011 where user engagement was defined as a “quality of the user experience that emphasizes […] the fact of being captivated by the technology.”
I often see engagement measured as return rate or retention, but these metrics focus on the effect of the engagement instead of looking at the engagement itself. To quantify engagement at StumbleUpon, we relate each user experience to retention. We separately monitor time spent engaging with the core stumbling experience and engagement with other features like Lists, Ratings, Sharing, etc.
AR: Q4. I consider StumbleUpon as one of the best examples of collaborative filtering or curation. Can you share a high-level overview of how the human opinions are merged with machine learning algorithms to generate such a great user experience?
DD: Users can express their opinions by submitting new content and rating recommended content.
We deeply rely on explicit ratings and other forms of implicit and explicit feedback to determine content quality. Although recent studies have related beauty, novelty and interestingness to statistical properties around content, we are not interested in judging quality without considering context. Therefore, we assess relevance for each segment of our user base.
We can also identify “Experts,” i.e., users who are trustworthy and likely to provide high quality feedback, and exploit their contributions at a level that may be higher than average.
AR: Q5. What do you mean by "native advertising"? Why is it a better experience for user?
DD: With the oversaturation of traditional advertising in the form of banner ads, homepage takeovers, etc., we have seen an emergence in native advertising - or sponsored content – used as a means of breaking through the clutter. Native content does this by delivering brand and publisher messaging to consumers in a more digestible way. It reads more editorially and helps keep users engaged because the content is not a traditional advertisement.
In an ideal scenario, users should not be able to discern between organic and sponsored content, and it is a better experience since the ad does not disrupt the flow of the user’s activity. In the case of StumbleUpon, sponsored content is just “another Stumble” that can be rated or shared.
AR: Q6. What are the major trends that you observe in the current research activities in the domain of Data Mining and Information Retrieval? How do you expect the research focus to change in the next 2-5 years?
DD: I have seen increasing interest in multimedia recommendation systems. For example, the first workshop on Recommender Systems for Television and online Video (RecSysTV) will be held in conjunction with RecSys. Although video recommendation faces unique challenges, the growing number of users who consume videos justifies the increasing demand for it.
Another research opportunity could come from the “Internet of Things” concept. We’re faced with challenges like the need to integrate and query a vast range of diverse content formats (from text to sensor data). Although we have witnessed some research efforts in the areas of data management and data integration, I have not seen progresses able to make the “Internet of Things” a reality.
AR: Q7. What is the best advice you have got in your career?
Back when I was a young applied researcher at Yahoo!, one of the main frustrations for scientists was the difficulty to “productionize” research results. Paradoxically, publishing to international conference was “easier” than not pushing research findings to production. When I expressed my frustration to the head of research, he suggested that I not complain and instead push harder. He told me that I’m the only one who knows how good my results are and that if I was not able to find production support, it was because I was not communicating my results properly. Since then I have kept pushing.
AR: Q8. What key qualities do you look for when interviewing for Data Science related positions on your team?
DD: I look for people with experience in Machine Learning and Data modeling with a tangible passion for managing big data, curiosity and capacity of mining the opportunities beyond the data.
AR: Q9. What was the last book that you read and liked? What do you like to do when you are not working?
DD: The last book I read was a classic: “The Old Man and the Sea.” I am currently reading “El amor en los tiempos del colera.” Before the birth of my son, Luca, I enjoyed dancing tango, but now I mostly spend my free time (that is not much) with him.