Interview: Sastry Malladi, StubHub on Designing Big Data Architecture for the Unknown Future
We discuss the Big Data architecture at StubHub, important factors in architecture design, hybrid approach of using Big Data along with traditional data warehouses, challenges, importance of meta-data and more.
By Anmol Rajpurohit on July 28, 2014 in Architecture, Challenges, Design, Hadoop, Interview, Metadata, Personalization, Recommendation, Sastry Malladi, StubHub
Sastry is a veteran technologist with two and half decades of experience developing, leading and architecting various highly scalable and distributed systems. Before transitioning to StubHub, he led the architecture transformation of eBay from its monolithic architecture to the distributed, and scalable service oriented architecture that it is today. Prior to joining eBay, Sastry was co-founder and CTO of OpenGridSolutions, Founding member and Architect at SpikeSource, and an architect at Oracle.
Here is my interview with him:
Anmol Rajpurohit: Q1. What are the main goals of the Big Data architecture at StubHub?
Sastry Malladi: Our vision is to make StubHub a worldwide destination for end-to-end fan experience that includes discovering, purchasing and
In order to successfully do all of the above, we need to analyze our data coming from many sources and the big data platform is a great place to start doing that and feeding the results of the analysis (done via what are called Map Reduce Jobs) to appropriate systems. So we began our journey on the Big Data Architecture last year and have made good progress so far.
AR: Q2. What are the most important factors involved in the design of Big Data architecture?
SM:
One thing we know for sure w.r.t. Big Data systems is that, they are still evolving (for the better) and whatever components we develop or put in place today are bound to change a year or more from now. So it is important to put architecture in place that gives us the flexibility to adapt, while still protecting our investment.
The 6 key aspects of the architecture that we think gives us this leverage are the following (not in any particular order)
- Manageability – Ability to easily manage and operationalize the system
- Open Source Compliance – Compliance with open source components and standards is key to be able to leverage many evolving open source tools
- Scalability – Ability to horizontally scale
- Adaptability: Data Import - Ability to import various kinds of data sources and formats
- Flexibility: Data Export – Ability to export the data result sets to wherever needed.
- Integration with Visualization / reporting tools – Integration with either existing or new visualization tools
AR: Q3. Though the Hadoop ecosystem has matured immensely over a short span of time, many organizations including StubHub are currently taking a hybrid approach i.e. using Big Data technology as well as traditional transactional databases. Is this just an intermediate phase towards having a Big Data oriented architecture? Or are there any specific benefits in sticking to the hybrid approach? How do you see this change in the near future?
While the Hadoop ecosystem has evolved immensely over the past few years, and has demonstrated the power of its analysis framework (aka Map Reduce framework), organizations have not moved at the same pace. As a result, a hybrid approach of using existing Data Warehouses and big data platforms is commonly employed. This is true for organizations that have existing data warehouse infrastructure and tools in place.
- While it is now possible to collect, store, manage and process the data at higher volumes and velocity in Hadoop systems, access to the resulting data by business people is still not completely streamlined. There has been great progress in terms of the tools, but not at the level of maturity that is expected.
- There is still a big learning curve for people in understanding how data is represented and what kind of queries are optimal to execute on Hadoop systems etc.
- Talent – Organizations will take time to find and hire the right talent in this space, just like any other emerging technology space.
I think that over a span of next 3 to 5 years, organizations will mature using the Hadoop technology and will slowly deprecate data warehouses for active data processing. However, the results of the data analysis may still continue to be stored in a data warehouse for easier access by existing reporting tools. This hypothesis is applicable for organizations that have already begun their big data journey.
AR: Q4. What are the major challenges in processing and analyzing Big Data, particularly for an e-commerce marketplace firm?
SM: The challenges can be categorized into two buckets, namely, Technical and operational. The technical challenges are relatively easier to deal with, while the operational challenges take time to address.
Technical challenges
- Bringing in (ingesting) the required data sources into Hadoop. More often than not, these data sources are scattered and may not even be internally available. Their data formats (e.g name-value pairs, json, relational etc.) widely vary, not to mention the velocity of change.
- Data quality – How good and complete is the collected data
- Tooling to access data from Big Data systems are still evolving rapidly.
Operational challenges
- Different skillsets are needed to work with Big Data systems, than the traditional data warehouses.
- Educating and training people on how to use write effective Map Reduce jobs to get the data they want
- Maintaining multiple systems and keeping track of where the source of truth is etc.
AR: Q5. How mature are the current personalization and recommendation systems? What key trends do you observe in this area?
In my opinion, recommendations systems are fairly mature, while personalization (which is more than just recommendations) systems are still evolving. Personalization will be a key aspect of any consumer oriented website or application.
The growth of personal Mobile devices is dramatically changing this picture too, making it possible to capture a user’s implicit preferences and behaviors and personalizing the content. The distinction between search and recommendations is also dwindling, as search systems are smart enough these days to return personal recommended results.
AR: Q6. What do you mean by "meta-classification of data"? Why is it so important in the current context?
SM: When you are dealing with large volumes of data, it is important to be able bucket the data into
AR: Q7. What advice would you give to Data Science students and researchers who are just starting to work in this area?
SM: My simple advice would be to be cognizant of the fact that the technology landscape is rapidly changing in this space and be patient and ready to adapt as appropriate. But sky is the limit in terms of the value they are going to get from the big data systems.
AR: Q8. What are your favorite books on Data Science?
SM: I don’t specifically focus on data science and algorithms per say, but more on the platform/frameworks aspects that enable data scientists to do their analysis. While I read some books on this, I usually get a lot more information and latest trends from the Apache Hadoop and related websites.
Related:
- Interview: Cliff Lyon, Stubhub on Mastering the Art of Recommendation and Personalization Analytics
- Media Industry Embracing Analytics for Innovation and Competitive Edge
- Is Data Scientist the right career path for you? Candid advice
- Coding Ethics for AI & AIOps: Designing Responsible AI Systems
- Future Says Series | Discover the Future of AI
- Data Mesh & Its Distributed Data Architecture
- Data Mesh Architecture: Reimagining Data Management
- KDnuggets News, May 18: 5 Free Hosting Platform For Machine…
- Exploring Data Mesh: A Paradigm Shift in Data Architecture