Surfing the Big Data Wave at H2O World

Recent H2O World event showcased its open-source, scalable machine learning in the cloud, intended for people familiar with R but limited by its scalability. H2O can run on Hadoop and also on Apache Spark.

By Arun Swami, special to KDnuggets, Nov 2014.

H20 World Oxdata ( provides software to allow data scientists to quickly and easily run machine learning models at scale. The intended audience seems to be people familiar with R who are limited by the scalability of R. Using H2O allows data scientists to distribute machine learning algorithms over a cluster. Not all machine learning algorithms are currently supported but the supported list is quite impressive. Many R functions and structures are supported but this will never likely be a clone of R. Oxdata seems to use a freemium model: the basic software is free and open source. Enterprises can choose to buy a premium license that provides them with 24/7 support, help with optimizing and scaling clusters, etc. For details, please refer to the company Web site.

H2O World took place on November 18-19 in Mountain View, CA at the Computer History Museum. This is a report on Day 1, primarily devoted to sessions that were significantly “hands on” to help attendees get a feel for how they could use the H2O suite of tools.

The day started with an introduction to the company and its mission by Sri Ambati (CEO, Co-founder). The rationale for H2O can be summarized by:

  • Faster: Minutes vs. hours/days
  • Bigger: Bigger dataset / Cluster Mode
  • Better: Ease of Sampling and Feature Selection

Cliff Click (CTO, Co-founder) gave a high level presentation of the architecture. He showed how data and computation are distributed (the platform is written in Java). According to him, 100GB datasets can be handled easily and they are moving towards handling 1TB datasets. Analysis can be run using either a Web UI or R Studio.

Amy Wang gave a fast paced tutorial on running different machine learning models on H2O. They have a number of models out of the box that can run on a distributed cluster and more are being added. For many models, only some of the features are supported. For example, Generalized Linear Models do not support weights and Gradient Boosting Machine (GBM) do not support different loss functions.

Tom Kraljevic (VP Engineering) gave a talk on Using H2O in Big Data Environments. H2O can be run on Hadoop (YARN) and there is a project (Sparkling Water) to run H2O as an application on top of Spark. They plan to interoperate with Spark MLLib.

Arno Candel gave a comprehensive tutorial on doing deep learning using H2O. Free booklets on using R and on running deep learning are available at

Arno Candel gave an interesting talk on using auto-encoders for anomaly detection. In this case, the auto-encoder is a deep learning model where the number of input neurons is the same as the number of output neurons. The model learns the identity function using a hidden layer that as many fewer neurons than input or output. He also talked at a superficial level about using H2O for feature engineering.

H2O Conference Photo
Here are Sri Ambati and Arno Candel on stage.

Yan Zou and Vijay Iyengar talked about using H2O for Marketing and CRM. They also announced that they would run a competition using the KDDCup 1998 data set where participants would use H2O and would be ranked on how much better they performed than the baseline. The contest details will be posted on

Bio: Arun Swami is a Bay Area entrepreneur and tech leader, who created innovative systems using text mining, ranking algorithms, heuristic approaches, data mining, personalization technology, database algorithms and optimization algorithms. Arun was a key member of the team that started IBM's research in data mining and has published seminal work in this area. His classic data mining paper with Rakesh Agrawal "Mining Association Rules" is ranked among most cited CS papers.