H2O World 2015 – Day 1 Highlights
Highlights from talks and tutorials delivered by machine learning experts at H2O World 2015 held in Mountain View.
H2O is the leading open source machine learning platform for smarter applications. H2O.ai was selected as a Gartner Cool Vendor in Data Science for 2015.
On the very first day of the conference, H2O.ai team was excited to announce the close of a $20 million Series B funding round. H2O.ai will use its Series B capital to grow company's sales, marketing and customer success teams and to support exponential growth in customers and community. During the course of three days, there were many great tutorials and talks, some of them from machine learning industry experts well recognized across the industry.
First day of the conference was focused on imparting knowledge to the attendees. Here are highlights from day 1:
The conference started with a welcome note from Sri Ambati, CEO and Co-founder. Sri shared the journey of the company since its birth. About 5,000 companies and 25,000 data scientists have adopted H2O.
Erin LeDell, Data Scientist, H2O.ai delivered first talk on "Introduction to Data Science".
- Data Analysis
- R
- Python
- Data Mining
- Machine Learning
Mark Landry, Competition Data Scientist & Product Manager, H2O.ai shared following "Top 10 Data Science Pitfalls":
- Train vs. Test Error : Bad partition of data causes overfitting
- Train vs. Test vs. Valid : Validation set should be used strictly for model tuning. No general rule for data partition.
- Model Performance: Choosing performance metrics is very crucial.
- Class Imbalance: Manually upscale/down-sample minority class either set by duplicating rows or by using row heights. SMOTE algorithm can also be used.
- Categorical Data: Too many categories. Reduce categories by using some sensible higher-level mapping of categories if we have some hierarchical knowledge about the data.
- Missing Data: Depends on data.
- Outliers / Extreme Values: Remove observation / Apply transformation to reduce impact / Choose a more robust loss function / Impose constraint on data range.
- Data Leakage: Understand problem & data / Scrutinize model feedback e.g. relative influence
- Useless Models: Understand the problem
- No Free Lunch: No one single algorithm is the best. Try several algorithms and observe relative performance and the characteristics of your data.
There were multiple other hands-on training sessions on topics such as Python Pipelines, Gradient Boosting Method and Random Forest, Ensembles, Building Smart Applications, etc.
Day 2 highlights
Related: