H2O World 2015 – Day 1 Highlights

Highlights from talks and tutorials delivered by machine learning experts at H2O World 2015 held in Mountain View.

h2o-worldThe Machine Learning community gathered last week (Nov 9-11) at Computer History Museum, Mountain View for a very successful conference – H2O World 2015. The event brought together a community of H2O users, customers and experts to share their knowledge and discuss challenges across various domains such as data science, artificial intelligence and more.

H2O is the leading open source machine learning platform for smarter applications. H2O.ai was selected as a Gartner Cool Vendor in Data Science for 2015.

On the very first day of the conference, H2O.ai team was excited to announce the close of a $20 million Series B funding round. H2O.ai will use its Series B capital to grow company's sales, marketing and customer success teams and to support exponential growth in customers and community. During the course of three days, there were many great tutorials and talks, some of them from machine learning industry experts well recognized across the industry.

First day of the conference was focused on imparting knowledge to the attendees. Here are highlights from day 1:

The conference started with a welcome note from Sri Ambati, CEO and Co-founder. Sri shared the journey of the company since its birth. About 5,000 companies and 25,000 data scientists have adopted H2O.

Erin LeDell, Data Scientist, H2O.ai delivered first talk on "Introduction to Data Science". data_science-venn-diagramTalking about term "Data Science", she mentioned that the first occurrence of this term was in 1996 during a conference in Japan. Data Science consists of three major steps: Problem Formulation, Data Processing and Machine Learning. She also spoke about essential skills to be a data scientist which included: Math & Statistics, Programming & Database, Domain Knowledge & Soft Skills, Communication & Visualization. She shared survey of Data Scientists on LinkedIn which stated that the number of Data Scientists has doubled over the last 4 years and top 5 skills listed by Data Scientists:

  1. Data Analysis
  2. R
  3. Python
  4. Data Mining
  5. Machine Learning
She emphasized that companies should give up on hiring a data science unicorn and rather focus on forming data science teams. Data Science teams should comprise of Data Analysts, Data Engineers and Data Scientists. Talking about Data Science tools, she mentioned that we are headed towards language agnostic Data Science, where friendly APIs connect to powerful data processing engines. She briefly talked about Machine Learning, Deep Learning and problems which can solved using them.

Mark Landry, Competition Data Scientist & Product Manager, H2O.ai shared following "Top 10 Data Science Pitfalls":
  1. Train vs. Test Error : Bad partition of data causes overfitting
  2. overfitting
  3. Train vs. Test vs. Valid : Validation set should be used strictly for model tuning. No general rule for data partition.
  4. Model Performance: Choosing performance metrics is very crucial.
  5. Class Imbalance: Manually upscale/down-sample minority class either set by duplicating rows or by using row heights. SMOTE algorithm can also be used.
  6. Categorical Data: Too many categories. Reduce categories by using some sensible higher-level mapping of categories if we have some hierarchical knowledge about the data.
  7. Missing Data: Depends on data.
  8. Outliers / Extreme Values: Remove observation / Apply transformation to reduce impact / Choose a more robust loss function / Impose constraint on data range.
  9. Data Leakage: Understand problem & data / Scrutinize model feedback e.g. relative influence
  10. Useless Models: Understand the problem
  11. No Free Lunch: No one single algorithm is the best. Try several algorithms and observe relative performance and the characteristics of your data.

deep_learningArno Candel, Chief Architect, H2O.ai gave a tutorial on "Deep Learning". Deep Learning learns a hierarchy of non-linear transformations. It got a boost in past decade due to faster hardware and algorithmic advances. Deep learning has some strengths: non-linear, robust to correlated features, conceptually simple, etc and some weaknesses: slow to train, slow to score, overfitting, etc. H2O eco-system benefits includes scalable to massive datasets on large clusters, fully parallelized; low-latency java scoring code auto-generated; easy to deploy, etc. He did a hands-on tutorial on how to build a deep learning model, checkpoint and cross-validate.

There were multiple other hands-on training sessions on topics such as Python Pipelines, Gradient Boosting Method and Random Forest, Ensembles, Building Smart Applications, etc.

Day 2 highlights