Interview: Sastry Malladi, StubHub on Designing Big Data Architecture for the Unknown Future

We discuss the Big Data architecture at StubHub, important factors in architecture design, hybrid approach of using Big Data along with traditional data warehouses, challenges, importance of meta-data and more.

By Anmol Rajpurohit on July 28, 2014 in Architecture, Challenges, Design, Hadoop, Interview, Metadata, Personalization, Recommendation, Sastry Malladi, StubHub

Sastry Malladi is currently the Chief Architect at StubHub, an eBay company, responsible for the overall technology architecture, strategy and direction, including Big Data Platform development.

Sastry is a veteran technologist with two and half decades of experience developing, leading and architecting various highly scalable and distributed systems. Before transitioning to StubHub, he led the architecture transformation of eBay from its monolithic architecture to the distributed, and scalable service oriented architecture that it is today. Prior to joining eBay, Sastry was co-founder and CTO of OpenGridSolutions, Founding member and Architect at SpikeSource, and an architect at Oracle.

Here is my interview with him:

Anmol Rajpurohit: Q1. What are the main goals of the Big Data architecture at StubHub?

Sastry Malladi: Our vision is to make StubHub a worldwide destination for end-to-end fan experience that includes discovering, purchasing and

sharing post event experiences. In order to achieve that lofty goal, we need to understand our customers and their interaction patterns better. We need to be able to personalize their experience and recommend events based on their preferences and interests. We also need to understand how well is the experience working out for our customers, for continuous feedback loop to improve the experience. Of course, we need to keep an eye on fraudsters in the process.

In order to successfully do all of the above, we need to analyze our data coming from many sources and the big data platform is a great place to start doing that and feeding the results of the analysis (done via what are called Map Reduce Jobs) to appropriate systems. So we began our journey on the Big Data Architecture last year and have made good progress so far.

AR: Q2. What are the most important factors involved in the design of Big Data architecture?

SM:

One thing we know for sure w.r.t. Big Data systems is that, they are still evolving (for the better) and whatever components we develop or put in place today are bound to change a year or more from now. So it is important to put architecture in place that gives us the flexibility to adapt, while still protecting our investment.

The 6 key aspects of the architecture that we think gives us this leverage are the following (not in any particular order)

Manageability – Ability to easily manage and operationalize the system
Open Source Compliance – Compliance with open source components and standards is key to be able to leverage many evolving open source tools
Scalability – Ability to horizontally scale
Adaptability: Data Import - Ability to import various kinds of data sources and formats
Flexibility: Data Export – Ability to export the data result sets to wherever needed.
Integration with Visualization / reporting tools – Integration with either existing or new visualization tools

AR: Q3. Though the Hadoop ecosystem has matured immensely over a short span of time, many organizations including StubHub are currently taking a hybrid approach i.e. using Big Data technology as well as traditional transactional databases. Is this just an intermediate phase towards having a Big Data oriented architecture? Or are there any specific benefits in sticking to the hybrid approach? How do you see this change in the near future?

SM: Great question.

While the Hadoop ecosystem has evolved immensely over the past few years, and has demonstrated the power of its analysis framework (aka Map Reduce framework), organizations have not moved at the same pace. As a result, a hybrid approach of using existing Data Warehouses and big data platforms is commonly employed. This is true for organizations that have existing data warehouse infrastructure and tools in place.

While it is now possible to collect, store, manage and process the data at higher volumes and velocity in Hadoop systems, access to the resulting data by business people is still not completely streamlined. There has been great progress in terms of the tools, but not at the level of maturity that is expected.
There is still a big learning curve for people in understanding how data is represented and what kind of queries are optimal to execute on Hadoop systems etc.
Talent – Organizations will take time to find and hire the right talent in this space, just like any other emerging technology space.

I think that over a span of next 3 to 5 years, organizations will mature using the Hadoop technology and will slowly deprecate data warehouses for active data processing. However, the results of the data analysis may still continue to be stored in a data warehouse for easier access by existing reporting tools. This hypothesis is applicable for organizations that have already begun their big data journey.

AR: Q4. What are the major challenges in processing and analyzing Big Data, particularly for an e-commerce marketplace firm?

SM: The challenges can be categorized into two buckets, namely, Technical and operational. The technical challenges are relatively easier to deal with, while the operational challenges take time to address.

Technical challenges

Bringing in (ingesting) the required data sources into Hadoop. More often than not, these data sources are scattered and may not even be internally available. Their data formats (e.g name-value pairs, json, relational etc.) widely vary, not to mention the velocity of change.
Data quality – How good and complete is the collected data
Tooling to access data from Big Data systems are still evolving rapidly.

Operational challenges

Different skillsets are needed to work with Big Data systems, than the traditional data warehouses.
Educating and training people on how to use write effective Map Reduce jobs to get the data they want
Maintaining multiple systems and keeping track of where the source of truth is etc.

AR: Q5. How mature are the current personalization and recommendation systems? What key trends do you observe in this area?

SM:

In my opinion, recommendations systems are fairly mature, while personalization (which is more than just recommendations) systems are still evolving. Personalization will be a key aspect of any consumer oriented website or application.

The growth of personal Mobile devices is dramatically changing this picture too, making it possible to capture a user’s implicit preferences and behaviors and personalizing the content. The distinction between search and recommendations is also dwindling, as search systems are smart enough these days to return personal recommended results.

AR: Q6. What do you mean by "meta-classification of data"? Why is it so important in the current context?

SM: When you are dealing with large volumes of data, it is important to be able bucket the data into

different “sets” that are naturally connected through some association. Said it differently, we need to classify the data based on some metadata. For example, in the StubHub use case, we have lots of events and the data associated with them. But lets say, someone is trying to look for events that are “family friendly”. That’s one example of a meta classification. How do we know which events are family friendly? What determines family-friendliness? This classification typically initially happens through a manual process and then automated through machine learning.

AR: Q7. What advice would you give to Data Science students and researchers who are just starting to work in this area?

SM: My simple advice would be to be cognizant of the fact that the technology landscape is rapidly changing in this space and be patient and ready to adapt as appropriate. But sky is the limit in terms of the value they are going to get from the big data systems.

AR: Q8. What are your favorite books on Data Science?

SM: I don’t specifically focus on data science and algorithms per say, but more on the platform/frameworks aspects that enable data scientists to do their analysis. While I read some books on this, I usually get a lot more information and latest trends from the Apache Hadoop and related websites.

Related:

Interview: Sastry Malladi, StubHub on Designing Big Data Architecture for the Unknown Future

More On This Topic

Latest Posts

Top Posts