Interview: Thomas Levi, POF on How Online Dating is Improving Matching through Big Data
We discuss Big Data use cases at Plenty of Fish, insights from text mining of user profiles, using topic modeling for developing user archetypes, challenges and more.
Here is my interview with him:
Anmol Rajpurohit: Q1. What does PlentyOfFish(POF) do? What are the most important use cases of Big Data at PlentyOfFish?
Everyday 3.6 million unique users log into the site and send between 20k and 30K messages per minute. But what we’re most proud of, is all of the relationships that are created as a result of the site.
We use data for a lot of things at PlentyOfFish, both internally and externally. We collect data on successful couples that have used the site and use that information to train and update our matching algorithms via neural networks among other things. Internally, we use data for diverse projects like scam detection, predicting user on-boarding/churn, user behavior as well as more advanced social network analysis to understand how users message and cluster together on the site.
AR: Q2. What is your approach towards text mining the user profiles and messages? What kind of insights does it provide?
TL: One of the most interesting things about working for PlentyOfFish is getting to work with a huge amount of data on real people.
AR: Q3. How do you use Topic Modeling and LDA to develop user archetypes? How does this help improve the matching algorithms?
TL:
Topic modeling via LDA was originally developed to cluster and self-categorize text documents for things like articles in a wayWe’re using it as a form of feature reduction. For example a user might list “skiing” as an interest, while others might enter things like “snowboarding”, or “ski touring”. A human looking at that list would likely conclude that all of those people are into mountain based winter sports or outdoor sports. LDA allows us to take all of the things a user can write for their interests and group them naturally into a much smaller number of categories which can be used to search and match on.that a human being would. That idea applies just as well to users on a dating or social media site.
There are a few properties of LDA that make it a particularly good choice for finding user archetypes. The first is that the topics and what words have a high lift in them are determined organically; that is the actual users on the site determine them. This eliminates any bias that might be introduced. If every user who listed “snowboarding” also listed “puppies” then those would be very likely to occur high in a topic together (spoiler alert: they don’t). The second property is that LDA is a mixture model. What that means is that each user does not end up as just one thing, but can be a mix of various topics with different weights. If you think about how people actually are, that’s a good description. For example, I enjoy outdoor sports, but also nerdy TV, books and video-games. Simply listing that I’m into one of those things doesn't capture me as a whole. This model correctly labels me as a mix of all of these things.
We’re not currently using this model live yet, as we’re in the process of discussing implementations for it. It can be used as the basis for a matching algorithm on its own, as a factor in some of our other matching algorithms, or as a way of showing similar users to one someone is viewing. I believe its strongest potential is in search, as we can allow our members to insert whatever they want their potential match to be interested in and show them thematic matches, e.g. typing in “skiing and Netflix” will get you matches interested in outdoor sports and TV/movies. If your readers want to see it, they should let us (or me) know.
AR: Q4. What are the biggest challenges in harnessing the tremendous power of Big Data available in the form of user content and activities on PlentyOfFish?
TL: Our challenges fall into three rough categories. The first is the technical challenge of data gathering and what we now call data engineering. Building out a system to measure every action
The second is a business goals set of challenges. With all of that data, and all of the things we can do, how do I make the most impact? What’s highest value? I’m currently the only Data Scientist on the research team (though our research developers do a lot of data science too) so choosing projects is not always an easy task.
The third challenge is educational or cultural.
Coming from an academic science background, I've got a lot of background and faith in the scientific method, rigorous statistics and a culture of testing and experimentation. That’s not necessarily the norm in the business world. A large component of my job is moving the culture as much as I can in that direction so we’re always making educated, informed and data driven decisions.The second and last part of the interview.
Related: