Interview: Vasanth Kumar, Principal Data Scientist, Live Nation
We discuss challenges in analyzing bursty data, real-time classification, relevance of statistics and advice for newcomers to Data Science.
Anmol Rajpurohit: Q1. As the Principal Data Scientist at Live Nation, what kind of data problems do you face? What are the inherent challenges in working with data which is bursty and of rapidly evolving nature?
Vasanth Kumar: The first project that I worked on once I joined the Data Science team is the classic recommendation system which was a domain-specific proof-of-concept where I applied collaborative filtering (item based) and experimented with personalization at various levels of granularity and subsequently aimed to inform users of relevant findings in real-time through multiple mediums (including but not limited to social media). The other project on which I have spent a significant amount of time is the real-time anomaly detection/resource optimization problem, which is essentially a classification task on streaming data. That problem (i.e. real-time classification) is the first of its kind that I have had to solve. Obviously I have generalized the problem statement in both cases and a good solution to each problem would require some tailoring to the existing infrastructures/products at the organization, even if you choose to traverse the path of least resistance.
Finding the optimal level of cohesion between the existing infrastructure and a new product is always a challenge and a work in progress as the product and organization evolve over time. Another common attribute across all the data problems has been the diversity in data sources where each data source demands feature engineering logic that caters to that one in particular. The complexity occasionally has the potential to trickle all the way down to the bottom of your workflow.
To answer your second question, creating a stable infrastructure that can handle arbitrary high-velocity bursts of up to several million requests per minute was a challenge. Now, let's add learning to that challenge.
How does one effectively learn from data where existing features drop out and new features come in? The answer was online learning.A static model simply could not keep up with the morphing characteristics of the data. In other words, static models would become stale pretty rapidly. We built hybrid models (spanning data from multiple windows of time) that would demonstrate internal consistency and good cross-validation results but they could not keep up with the data.
AR: Q2. What do you consider as the major challenges of real-time classification?
VK: There are two parts to answering this. One enormous challenge as I had mentioned earlier has been building the framework that would plug in to the existing infrastructure so that we could have an isolated environment whose primary purpose is to perform online learning on streaming data. During the prototype phase, we had always suspected that we would eventually be moving to an online learning setting but underestimated the need to do so and thus settled for the "offline modeling with online classification" methodology, which obviously made our engineering lives easier. However, once we realized that learning in real-time is crucial for getting consistently reasonable performance, we had to invalidate a lot of the prior assumptions and start re-engineering a lot of the pieces.
The other big challenge is the data-modeling component. As you probably realize, building a model offline and simply using it for classification in real-time is vastly different from having to do both in real-time. Hadoop was sufficient for the former but we required something like Storm to be able to move the modeling process into the real-time realm. We had to determine how every aspect of offline-modeling (especially feature engineering and labeling) would translate into the online learning counterpart. We found that to be non-trivial and required several tradeoffs and assumptions. For example, should we have dynamic confidence thresholds since the models are evolving constantly? How do we maintain balance between classes when training? It was also evident that training examples should have different weights but since each request is a potential stimulus, assignment of weights is non-trivial.
AR: Q3. How much statistics do you use in your job? Do you think statistical knowledge is of key importance?
VK: My time at work is split between prototyping solutions to data problems (think: modeling, analytics) and consequently engineering production code to bring the prototype to life. As you can imagine, the latter half of my job has very little to do with statistics and almost everything to do with developing code. In contrast, the former half involves researching and prototyping either new solutions or improvements upon existing solutions. To do this, I once again need to utilize the very same frameworks as I mentioned earlier (but can afford to be a little sloppy with my code).
In addition, I utilize tools like R (for basic data analysis and visualizations) and occasionally other 3rd party libraries that provide specialized implementations of learning algorithms so that I do not have to invest significant resources to re-invent the wheel. Once we determine that a certain technique performs well, we move on to implementing and integrating it into the product. So, if I had to put a number to it, probably 25% of my time at my job would be statistics-related.
The answer to your second question is pretty straightforward under my belief that machine learning has very deep roots in statistics. So, it is of extreme importance and great benefit to anyone in the field of machine learning and data analytics to have a thorough understanding of statistics and probability theory. The ability to interpret the math (and understand the assumptions) behind that learning module that I engineered into the product provides a certain level of comfort and confidence. No standard technique works and integrates magically out of the box and so the added statistical knowledge gives me a better understanding, and therefore more options to create a tailored application.
AR: Q4. What motivated you to work in data analytics?
VK: In short, it was a combination of passion and the right opportunities. The long version of the answer is, like most others, I have always attempted to work on things for which I have a passion for and reasonably good at. I like to believe that I have succeeded in being able to place myself in the right opportunities more times than not. Doing what I am doing now is a consequence of that. My interest in the field is rooted in the projects that I enjoyed the most early in life: building a spam detector using naive bayes, implementing a single hidden-layer neural network that uses back-propagation for character recognition and the application of HMMs to identify regions of interest in genome data are a few among others.
I was persistent about what I wanted to do and kept digging deeper into the field of data mining and consequently ended up spending the latter part of my graduate school in the UCI DataLab while working on my thesis at the same time. . I also consider myself lucky to have had the opportunity to learn from and work with people that are leading experts in the field. I was exposed to the endless possibilities of learning from data in various fields, particularly in the field of medicine, which was my focus during my time as a student as I concentrated in the Informatics in Biology and Medicine (IBAM) program at UCI.
AR: Q5. What advice would you give to data mining students and researchers who are just starting to work in this area?
VK: First, the importance of domain-specific knowledge for any arbitrary dataset cannot be overstated. Each time I am faced with data from a new domain, the first discussion with the domain experts is always an eye-opener. In other words, every dataset is different.
Second, and this has more to do with good engineering, it is vital to start with the data at its most raw form and to understand the goal and needs of the product that you are building. It is crucial to be constantly aware of the end-to-end architecture that will utilize the model you are building. In my experience, there is a vast spectrum where at one extreme we have models that can be stripped from the overarching system with little to no impact and at the other extreme we have models that are extremely tightly integrated. Where does yours fall? There are trade-offs in each case but your product and organization should dictate what is most desirable.
Third, particularly if you are out of academia, always start with something (anything!) that works and establish a baseline. Suppress that urge to improve on the AUC until you have a product that delivers on the most basic functional needs. A/B testing in the real world is typically (read: always) a better indicator of performance and easier to interpret when it comes to justifying a "better model" ("added revenue", "improved UX", etc). On an important side note, without the proper infrastructure to perform A/B testing, model comparisons are moot.
AR: Q6. What book (or article) did you read recently and would strongly recommend?
VK: In the interest of staying relevant to the topic at hand, I would suggest an article that I came across recently in the Financial Times, titled "Big data: are we making a big mistake?" It is always good to take a step back when you are immersed in your field of work and have a broader perspective on things. The article does a great job of highlighting the pain points of big data to remind us that there is no silver bullet and "N=All" (i.e. the sampled population is the entire population) is never (ever?) true regardless of how big your data may be. Therefore, one always has to treat their models with scrutiny. Moreover, it is crucial to determine and be mindful of inherent biases when attempting to generalize observations. The importance of doing so is illustrated in the article through multiple studies from history, each with its own flaws. The article hardly takes a stance. It merely observes with cautionary tales.