Predictive Analytics Innovation Summit 2014 London: Day 2 Highlights

Highlights from the presentations by Predictive Analytics leaders from Spotify, ING, Quintiles, and Riot Games on day 2 of Predictive Analytics Innovation Summit 2014 in London, UK.

As the amount of data that companies are able to effectively collect, store and analyse increases, organisations now face new challenges in making use out of this vast new resource. Big Data offers all companies the opportunity transform their organisation into a data-driven culture, promising a more efficient business and offering a more informed decision-making process.

While data analytics and modeling can offer an accurate picture of what is happening now, predictive analytics gives an extra benefit in bringing together information to offer an accurate prediction of future action. Effectively knowing how customers or markets will behave before they do offers a new opportunity for companies, and if they are able to capitalize on this before competitors then they will gain a crucial advantage in driving success.

The Predictive Analytics 2014 LondonPredictive Analytics Innovation Summit (May 14 & 15, 2014) was organized by the Innovation Enterprise at London, UK. The summit brought together Analytics leaders from various industries for interactive sessions and thought-provoking discussion on how they are using Analytics to gain a clearer picture of their customers, market and organisation as a whole.

We provide here a summary of selected talks along with the key takeaways.

Highlights from Day 1.

Here are highlights from Day 2 (Wednesday, May 15, 2014):

Ali SarrafiAli Sarrafi, Product Owner, Spotify gave an intriguing talk on "Managing Experiments at Spotify". Spotify strives for team autonomy and independence. This means that no team should be blocked by others and they should be able to move as fast as they can. The autonomy has proven to be a challenge for managing a centralized and coordinated experimentation infrastructure and analysis. He shared his story of setting up the experimentation infrastructure for Spotify and how he handled experimentation in a complex multi-platform environment.

Spotify has over 40 autonomous teams working on features, over 7 platforms with multiple features, and over 3000 source repositories. It is important to align on metrics rather than actual tests. Metrics undergo their own evolution - so, patience and determination is key to identifying the right metrics and the best way to use them. Many organizations miss to focus on functionality as a metric, which is a serious blunder. He also talked about how to find the right balance between speed of development and proper experimentation and data driven culture. In conclusion, he emphasized that:
  • Metric alignment is the most effective way of making sure the overall user experience does not suffer (due to product experimentation)
  • Education and Automation are key for adoption
  • People as a service is a most effective way of reaching out to teams

Natalino BusaNatalino Busa, Big Data Technologist, ING shared some great Big Data learning in his talk "Big Data Solutions for Marketing Research at ING Retail Netherlands". The retail banking market demands now more than ever to stay close to the customers, and to carefully understand what services, products, and wishes are relevant for each customer  at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks.

There is an immense need to humanize data - through storytelling, visualization and other means - in order to make it truly usable. Data is the fabric of our lives. So, let's give more meaning and context to data. He explained the human expectations from data through the well-known Maslow's Hierarchy of Needs, in context of the retail banking customers. He mentioned that the top 3 tasks of a data scientist are: dimensionality reduction, clustering segmentation and predictive analytics. In data science, it is important to keep it scientific (by cross-validating models and keeping it measurable), and play with it (creating new features and exploring available data).

The goal should be to earn the customer's trust by understanding them deeply. The key challenges are that there is not much time to reach (customers want everything NOW, no more room for latency), and there is a lot of information to process (larger context is required for deep learning). By using Hadoop and open source statistical language and tools such R and Python, firms can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework. He described the following as key lessons learned:
  • Mix and match of technologies is a good things
  • Fast Data must complement Big Data
  • Data Science takes time to figure out

Vladimir AnisimovVladimir Anisimov, Senior Strategic Biostatistics Director, Quintiles delivered an insightful talk on "Predictive Analytic Modelling of Clinical Trial Operations". Drug development process is highly complicated and costly. The majority of existing tools in pharma companies still use ad-hoc simplified or deterministic models, which leads to inefficient design, under-powered studies, extra costs and drug waste. Efficient design and forecasting clinical trial operation require developing predictive analytic techniques accounting for major uncertainties in input data, stochasticity of the enrolment & events, and hierarchic structure of operational characteristics. Next, he discussed some innovative statistical techniques for predictive modelling: patient's enrollment, trial/site performance, risk-based monitoring, and operational characteristics & costs.

He discussed innovative analytic techniques for predictive modelling operational processes in late stage clinical trials. Patient enrollment modelling forms an underlying methodology. As the next stage, more complicated processes on the top of enrollment are modeled by using hierarchic evolving processes. The technique for evaluating predictive distributions is developed. It allows closed-form solutions for many practical scenarios; Monte Carlo simulation is not required.  Developed tools are applied to dynamic modelling of the number of patients in trial, events in event-driven trials, operational costs, and site risk-based monitoring. Finally, he shared some case studies which involved novel predictive modeling solutions built using C#, R and RExcel.

Peter TillotsonPeter Tillotson, Research Data Scientist, Riot Games talked about the problem of duplicates and his approach for solving it, in his talk "Data Finds Data – The Rest is Math". The success of Riot Games is due in no small part to its team’s ability to respond to changing landscapes. Achieving this level of adaptability requires talented people with a vast amount of responsibility and ownership. It’s an approach that’s common across successful companies in the various industries. Modular and agile development enables teams to make rapid progress and adapt to changes, but for a data scientist, they can lead to a painful truth: incompatible data models, each with their own way of identifying a game player.

As a start-up, MySQL and Excel were good enough for firm's analytical needs. But after witnessing strong growth, these basic tools were no longer able to handle the workload and thus, the firm had to switch to Hadoop, since it was cost-effective, scalable, and open-source. Peter’s Big Data challenge was de-duplication, given that databases were using different IDs for the same player. Player interactions are complex, yet they need to be understood well in order to uniquely identify the player, which is the first step towards providing them a great gaming experience.

Duplicates are a common problem arising from various reasons. Data schemas tend to evolve, leading to duplicates. Sometimes it's nice to keep data separate, such as for security or performance reasons, or simply to decouple project dependencies. In other words, the challenge is that whenever a new record is observed, we need to determine if this matches anything we have seen so far, i.e. the historical records in the database. This can be solved by "Like Routing" process for identifying the entity using nGram partitions. Describing the technical part of the solution, he explained the benefits of using Hadoop, Accumulo (distributed column store) and Intelligent keys.