Interview: Prateek Jain, Director of Engineering, eHarmony on Fast Search and Sharding
We discuss Big Data architecture, fast multi-attribute searches, database sharding and scaling challenges at eHarmony.
Prateek Jain is Director of Engineering at Santa Monica based eHarmony (leading online dating website) where he is responsible for running the engineering team that builds systems responsible for all of eHarmony's matchmaking. Prior to this he spent numerous years building cloud based image processing systems and Network Management Systems in the Telecom domain. His areas of interest include Distributed Systems and High Scalability.
Prateek recently delivered a talk at Big Data Innovation Summit 2014 held in Santa Clara on “Stores Behind the Doors”.
Here is my interview with him:
Anmol Rajpurohit: Q1. What factors have had the most influence on the architectural changes for Big Data at eHarmony?
Prateek Jain: Our ultimate goal here at eHarmony is to provide each and every user a unique experience that is tailored to their personal preferences as they navigate through this very emotional process in their life. The more efficiently we are able to process our data assets the closer we get to our goal. Most of the architectural decisions are driven by this core philosophy.
A lot of data driven companies in internet space have to derive information about their users indirectly, whereas at eHarmony we have a unique opportunity in the sense that our users willingly share a lot of structured information with us, hence our big data infrastructure is geared more towards efficiently handling and processing large volumes of structured data, unlike other companies where systems are geared more towards data collection, handling and normalization. That said we also handle a lot of unstructured data.
AR: Q2. In your talk, you mentioned that the eHarmony user data has over 250 attributes. What are the key design factors to enable fast multi-attribute searches?
PJ: Here are the key things to consider when trying to build a system that can handle fast multi-attribute searches
- Understand the nature of your problem and pick the right technology that fits your needs. In our case the multi-attribute searches were heavily influenced by Business rules at each phase and hence instead of using a traditional search engine we used MongoDB.
- Having a good indexing strategy is pretty important. When performing large, variable, multi-attribute searches, have a decent number of indexes, cover the major types of queries and the worst performing outliers. Before finalizing the indexes ask yourself:
- Which attributes are present in every query?
- What are the best performing attributes when present?
- What should my index look like when no high-performing attributes are present?
- Omit ranges in your queries unless they are absolutely critical; ask yourself:
- Can I replace this with $in clause?
- Can this be prioritized in its own index?
- Should there be a version of this index with or without this particular attribute?
- With MongoDB ordering in indexes is very important. the order should be:
- First, fields on which you will query for exact values
- Second, fields on which you will sort
- Finally, fields on which you will query for a range of values
AR: Q3. Why is it important to have built-in sharding? Why is it a good practice to isolate queries to a shard?
PJ: For most modern distributed datastores performance is the key. This often requires indexes or data to fit entirely in memory, as your data grows it doesn't remain true and hence the need to split the data into multiple shards. If you have a rapidly growing dataset and performance continues to remain the key then using a datastore that supports built-in sharding becomes critical to continued success of your system as it
- Prevents you from spending a lot of time trying to managing the cluster
- Prevents your applications from having to worry about internals of how data is stored
As for why is it a good practice to isolate queries to a shard, I'll use the example of MongoDB where "mongos" a client side proxy that provides a unified view of the cluster to the client, determines which shards have the required data based on the cluster metadata and directs the query to the required shards. Once the results are returned from all the shards "mongos" merges the sorted results and returns the complete result to the client.
Now in this scenarios "mongos" has to wait for results to be returned from all shards before it can start returning results to client, which slows everything down. If all the queries can be isolated to a shard then it can avoid this excessive wait and return the results faster. Hence it is a good idea to examine possible set of queries before hand and use that information to come up with a effective shard key.
This phenomenon will apply pretty much to any sharded data-store in my opinion. For the stores that do not support built-in sharding, it'll be your application that'll have to do the job of "mongos".
AR: Q4. How did you select the 3 specific types of data stores (Document/Key Value/Graph) to resolve the scaling challenges at eHarmony?
PJ: The decision of choosing a specific technology is always driven by the needs of the application. Each of these different types of data-stores have their own advantages and limitations. Staying prudent to these facts we've made our choices. For example:
- One of our applications required a data-store of Schema free nature, with not much cluster management overhead, yet it needed to support fast multi-attribute queries. Looking at the available choices MongoDB seemed to be the best fit.
- For some of our services that we were building to move towards SOA, we were in need for a simple data-store that could provide a quick key value lookup and had a good Java API, we ended up going with project Voldemort.
- As for using the Graph databases we had this application which really had the need to persist and traverse relationships effectively yet at the same time have a flexible enough schema to add new relationships on the fly. Graph databases perfectly fit the bill.
And in some cases where your choice of the data-store is lagging in performance for some functionality but doing an excellent job for the other, you should be open to Hybrid solutions.
AR: Q5. Which of the current trends in Big Data arena are of great interest to you?
PJ: These days I'm particularly interested in whats happening in the Online Machine learning space and the innovation that's happening around commoditizing Big Data Analysis.
AR: Q6. What key qualities do you look for when interviewing for Data Science related positions on your team?
PJ: For me it would be:
- Software Engineering skills
- Excellent knowledge of Statistics and math
- Commitment and creativity
- Domain knowledge of the business, it allows them to know what are the things to focus on when looking at the data.
AR: Q7. What are your favorite books on Data Science?
PJ: Actually none, I try to do my learning via Academic papers, Blogs, conferences and talks.