KDnuggets Interview: Juan Miguel Lavista, Microsoft Data Science Team

We discuss Randomized Controlled Experiments, common errors during A/B testing, Correlation vs. Causality, Big Data Myths and setting up realistic expectations from Big Data and more...

By Anmol Rajpurohit (@hey_anmol), Apr 30, 2014.

Juan Miguel LavistaJuan Miguel Lavista is currently the Principal Data Scientist for Microsoft Data Science team (DnA), where he works with a team of data scientists searching for insights in petabytes of data. Juan joined Microsoft in 2009 to work for the Microsoft Experimentation Platform (EXP) where he designed and ran randomize control experiments across different Microsoft properties. In Microsoft, Juan also worked as part of the Bing Data Mining team. Before joining Microsoft, Juan was the CTO and co-founder of alerts.com. Juan has 2 computer science degrees from the Catholic University in Uruguay, and a graduate degree in Data Mining from Johns Hopkins University. He lives in Kirkland, WA, with his wife and daughter. He has been a speaker at conferences in many countries including the US, Canada, Argentina, Colombia and Uruguay, and he also was a TedX Speaker in 2010.

Juan recently delivered a keynote at the Big Data Summit 2014. His talk, "The Good, the Bad, and the Ugly of Big Data and Data Science”, focused on the myths of big data and why the important word in data science is science and not data.

Here is my interview with him:

Anmol Rajpurohit: Q1. In your keynote you advised that we should focus on the problem and not on size of the data. But quite often, Data Scientists face the question "Here is our firm's customer data. See if you can find any insights (that were previously unknown) of business value". So, in situations where there is no well-defined problem, rather the task is of exploratory nature, what should one focus on?

Data ExplorationJuan Miguel Lavista: Exploratory work is an important part of a data scientist's job. The focus of the data scientist needs to be on what we can do with this data, what questions can be answered based on this data, and what type of problem can be solved. Curiosity is key here. During this process, a data scientist will need to understand how the data was collected, possible bias, and more importantly, the quality of the data. It is important to understand that some problems will require more data than others, more data will also require more processing time and/or more cost. It is the job of the data scientist to understand this tradeoff.

AR: Q2. You mentioned that it is very important to "be able to run massive amounts of randomized controlled experiments". What are the challenges involved in concurrently running such large-scale experiments?

JML: On the one hand, there are system challenges: you need to be able to collect all the data in near real time and also process all the metrics. This is definitely not an easy job. The system also needs to be very flexible and it needs to support the capacity to add new instrumentation, define new metrics etc., and this is also challenging. On top of this you have to make sure you have good data quality, more than 50% of the traffic in the internet is produced by bots, and so filtering bot traffic is key in order for the experimentation system to be trustworthy. On the other hand, there are cultural challenges: you need everyone in an organization to understand the value of running control experiments and how to interpret correctly all the results, this requires a cultural change within an organization.

AR: Q3. How do you address the concern of potential negative effect on user experience?

JML: On average, 2/3 of the ideas or features will have negative or flat impact. People that worry about experimenting do not realize that without experimentation they will be flying blind and will not know if the feature or idea is good or not. We need to ask ourselves what is better: ship something to 100% of our users without knowing if it will work, or test it with a small % of users for a period of time and only ship the positive ones?

2/3 of the time our ideas will fail, fast innovation comes from failing fast and shipping only the 1/3 good ideas… experimenting is the way to find out which are those 1/3rd of good ideas..

Another important thing is that if the experiment is really bad and is hurting user experience, it will get statistical significant pretty fast... and at that point we can either manually or automatically stop it.

AR: Q4. According to you, what are the key success factors of Randomized Controlled Experiments?

JML: I think that the most important success factor is that the organization needs to embrace a good experimentation culture. Not only understanding the value of controlled experiments, but also understanding that the majority of the time our ideas will fail and this is not a problem, but rather just part of the innovation process. The organization needs a data driven culture and the right incentives to run experiments making it clear for everyone in the organization that decisions are made based on data and not just hunches.

AR: Q5. What are the most common errors done during A/B testing?

A/B TestingJML: On the experiment design front, a frequent problem is to have the wrong success criteria. The OEC (overall evaluation criteria) is the metric we use to measure success. Choosing a metric seems trivial, but it is definitely not.
Having the right metric takes a lot of effort, and if it is the wrong one, this is similar to being lost in the jungle with the wrong map.

Another common problem is to have a bias in treatment or control, for example if one of the pages is cached but not the other one, or there is an extra redirect, etc., many factors can easily introduce bias. This is why it is so important to run A/A tests before to make sure the system is trustworthy.

AR: Q6. From the perspective of Predictive Analytics do you think high values for correlation are good enough? Identifying and establishing causality can often be a daunting task. Do you consider it worth the effort?

JML: As like many other things, “it depends” on how this information will be used. If we understand what the predictive model is saying, we usually do not have a problem. The issue is when we believe the output of the model based on correlation is causality, at that point we have a problem.

The other problem I see often, is that we have enough data and variables, we will find correlations that will be there just by coincidence … and even though those models will work fine in testing environment, they will not have predictive power in unforeseen data, this is something common where production data is scarce for example earthquake detection

AR: Q7. Besides the 5 myths you shared at Big Data Innovation Summit, any other Big Data myths that you would like to share? Refer: Highlights of Keynote Speeches on Day 1 of Big Data Innovation Summit 2014

Big Data MythsJML: I think myth number 6 would be that people think that getting value out of data is something new. Statisticians have been getting value out of data for generations now, to the point that probably some of them think that the whole idea of data scientist is just an identity theft! Galileo, Newton, etc… all used observations to deduce models about the world. Fisher was running control experiments in agriculture in 1920s. In 1854, there was an outbreak of cholera in London and people believed cholera was airborne. John Snow, using data on a map, devised a theory that cholera was spread through the ingestion of polluted water, and was able to identify a water pump as the source of the disease.

AR: Q8. What do you consider the primary reasons behind the Big Data hype? Amid such hype, how can people gain a true understanding of Big Data, in order to set up realistic expectations and a pragmatic implementation approach?

Big Data HypeJML: There are many successful stories of using data; from search engines, to recommendation engines, moneyball, Nate Silver, etc.

The interesting factor about how we end up with this hype is another example of the wrong use of data, where people see that “Big Data” is correlated with success then “Big Data” implies success…that is of course wrong…. Because of this, when projects like Google Flu fail, then people start thinking that big data is useless, which is also wrong.

There is no question on the value of using data, but what people need to understand is that getting value out of data is hard and requires a lot of effort. They also need to understand that many projects or models will fail and this is just normal.

AR: Q9. If you were a fresher starting in Big Data industry today, how would you shape up your career?

JML: I would create a strong foundation in math and statistics, this is fundamental for any data scientist. Also start as early as you can to play with data. Now it is much easier, there are a lot of data sources that you can download for free that are great.

AR: Q10. What are your favorite books or blogs on Big Data?

JML: SimplyStatistics.org from Jeff Leek, Roger Peng, and Rafa Irizarry, is a great blog. I also like R-blogger and Occam’s razor from Avinash Kaushik, as well as the news section from KDnuggets, I find that your selection of stories are very good.

On books, Bayesian Reasoning and Machine Learning from David Barber, Learning from Data from Abu-Mostafa, and How to measure anything from Douglas Hubbard.