H2O World 2015 – Day 2 Highlights

Highlights from talks delivered by machine learning experts from H20.ai, Jawbone, Stanford, Quora & PayPal at H2O World held in Mountain View.

Xavier Amatriain, VP of Engineering, Quora talked about Machine Learning at Quora. Giving a brief overview of the product and mission, Xavier mentioned that Quora cares about three different dimensions which must go hand-in-hand: Relevance, Quality and Demand. He shared and discussed data eco-system at Quora.

quora-data-relations In context of Machine Learning, he shared the problem of "Answer Ranking" which involves multiple factors such as truthful, reusable, provides explanation, and well-formatted. Data Scientists work to translate these dimensions into features - features that relate to the text quality itself; interaction features (upvotes/downvotes, clicks, comments); and user features (expertise in topic. etc.). He also shared the problem of ranking in feed.

Machine Learning is used to personalize learning-to-rank approach. Feature engineering is very important. He also discussed various other challenges and how they are using ML to solve those complex problems. He mentioned that at Quora, there is not only big data but also rich data. So, the algorithms need to understand and optimize complex aspects such as quality, interestingness, or user experience.

nachum-paypalNachum Shacham, Principal Data Scientist, PayPal delivered a talk on "Data Science with Big Data in a Corporate Environment - Tasks, Challenges, and Tradeoffs". Giving overview of Data Science, he mentioned that there is a big gap and even a conflict between Big Data and Machine Learning. Machine Learning likes a very neat data set with known statistical properties and no/less noise in order to run a decent model. However, in Big Data there are multiple sources of data; sparse, redundant, not known, etc. We need to work a lot on data to get good data for providing the input for the model.

This involves data sourcing, data exploration, data munging. He discussed predictive models for serving customers to understand which customers should be targeted. Some real- life problems we face - Unbalanced training dataset; latency of results (realtime/batch); precision & recall have different cost based on different applications. Data exploration is all about knowing what's in the data and fitting the pieces together. Data munging involves reshaping of data by making decisions for categorical variables, irrelevant/redundant values, skewed distributions, key join, etc. He shared some great data munging packages and functions. Feature Engineering is very much required to transform raw data to better represent the problem to the model. He discussed Interpretability vs. Accuracy under Model Tuning.

He emphasized that we need to take care of all little pieces. Otherwise, we might get delayed, incomplete or irrelevant results. All the tasks which fall in shadow are important because they determine whether even the best algorithm will work or will be flown by the wayside.

Highlights from Day 3 will be published soon.