Interview: Thomas Levi, POF on How Online Dating is Improving Matching through Big Data

We discuss Big Data use cases at Plenty of Fish, insights from text mining of user profiles, using topic modeling for developing user archetypes, challenges and more.

Thomas LeviThomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His postdoctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC. In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behavior analysis, social network analysis, scam detection, bot detection, matching algorithms, topic modelling and semantic analysis.

Here is my interview with him:

Anmol Rajpurohit: Q1. What does PlentyOfFish(POF) do? What are the most important use cases of Big Data at PlentyOfFish?

POF LogoThomas Levi: PlentyOfFish is the world’s largest online dating site with over 80 million registered users.

Everyday 3.6 million unique users log into the site and send between 20k and 30K messages per minute. But what we’re most proud of, is all of the relationships that are created as a result of the site.

We use data for a lot of things at PlentyOfFish, both internally and externally. We collect data on successful couples that have used the site and use that information to train and update our matching algorithms via neural networks among other things. Internally, we use data for diverse projects like scam detection, predicting user on-boarding/churn, user behavior as well as more advanced social network analysis to understand how users message and cluster together on the site.

AR: Q2. What is your approach towards text mining the user profiles and messages? What kind of insights does it provide?

TL: One of the most interesting things about working for PlentyOfFish is getting to work with a huge amount of data on real people. User ProfilesPeople behave very differently when they know they are answering questions on a survey rather than their natural behavior or descriptions of themselves. In the case of looking at the interests users put on their profiles, which is what I focused on for this project, there’s no system to game or reason to not simply write what you’re actually interested in. This stands in sharp contrast to the sort of BuzzFeed like questionnaires where you might be trying to tailor your answers to get certain results. In our case, we can see how users and interests cluster together by things like location, gender or age. You can start to ask questions like what do people do for fun in Texas vs. California? What group of people is most romantic? Nerdy? Hipster? Sometimes those results can surprise you.

AR: Q3. How do you use Topic Modeling and LDA to develop user archetypes? How does this help improve the matching algorithms?

Topic modeling via LDA was originally developed to cluster and self-categorize text documents for things like articles in a way Topic Modelingthat a human being would. That idea applies just as well to users on a dating or social media site.
We’re using it as a form of feature reduction. For example a user might list “skiing” as an interest, while others might enter things like “snowboarding”, or “ski touring”. A human looking at that list would likely conclude that all of those people are into mountain based winter sports or outdoor sports. LDA allows us to take all of the things a user can write for their interests and group them naturally into a much smaller number of categories which can be used to search and match on.

There are a few properties of LDA that make it a particularly good choice for finding user archetypes. The first is that the topics and what words have a high lift in them are determined organically; that is the actual users on the site determine them. This eliminates any bias that might be introduced. If every user who listed “snowboarding” also listed “puppies” then those would be very likely to occur high in a topic together (spoiler alert: they don’t). The second property is that LDA is a mixture model. What that means is that each user does not end up as just one thing, but can be a mix of various topics with different weights. If you think about how people actually are, that’s a good description. For example, I enjoy outdoor sports, but also nerdy TV, books and video-games. Simply listing that I’m into one of those things doesn't capture me as a whole. This model correctly labels me as a mix of all of these things.

We’re not currently using this model live yet, as we’re in the process of discussing implementations for it. It can be used as the basis for a matching algorithm on its own, as a factor in some of our other matching algorithms, or as a way of showing similar users to one someone is viewing. I believe its strongest potential is in search, as we can allow our members to insert whatever they want their potential match to be interested in and show them thematic matches, e.g. typing in “skiing and Netflix” will get you matches interested in outdoor sports and  TV/movies. If your readers want to see it, they should let us (or me) know.

AR: Q4. What are the biggest challenges in harnessing the tremendous power of Big Data available in the form of user content and activities on PlentyOfFish?

TL: Our challenges fall into three rough categories. The first is the technical challenge of data gathering and what we now call data engineering. Building out a system to measure every action User Generated Contentusers can take on a site at our scale is an extremely complex task. It’s made even more so because as a dating site, users can interact with each other so, much like Facebook, network and social graph effects add another layer of complexity. Once that system is built, we have to store the data, and store it in a way that a Data Scientist like myself can query and work with. I work closely with the team building out this system and I've done a lot of work making custom packages in things like R to handle some of the data.

The second is a business goals set of challenges. With all of that data, and all of the things we can do, how do I make the most impact? What’s highest value? I’m currently the only Data Scientist on the research team (though our research developers do a lot of data science too) so choosing projects is not always an easy task.

The third challenge is educational or cultural.

Coming from an academic science background, I've got a lot of background and faith in the scientific method, rigorous statistics and a culture of testing and experimentation. That’s not necessarily the norm in the business world. A large component of my job is moving the culture as much as I can in that direction so we’re always making educated, informed and data driven decisions.
The second and last part of the interview.