Interview: Arno Candel, on How to Quick Start Deep Learning with H2O

We discuss H2O use cases, resources to start using H2O for Deep Learning, evolution of High Performance Computing (HPC) and the future of HPC.

arno-candel-h2oDr. Arno Candel is a Physicist & Hacker at Prior to that, he was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in high-performance computing and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in U.S. DOE scientific computing initiatives and collaborated with CERN. Arno has authored dozens of scientific papers and is a sought-after conference speaker.

He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named 2014 Big Data All-Star by Fortune Magazine.

First part of interview.

Here is second part of my interview with him:

Anmol Rajpurohit: Q4. What are your favorite H2O use cases?

h2o-worldDr. Arno Candel: H2O is one of the fastest growing machine learning and data science communities with over 10,000 new installations last year. We sold out our first Community Conference, H2O World, where customers and users presented their use cases. There were some fun ones we didn’t even know about. Being open source means that we typically hear from customers around the time they go into production.

For example, Cisco built a Propensity to Buy Model Factory using H2O. Paypal uses H2O for their Big Data Analytics initiatives and H2O Deep Learning for Fraud Detection. Ebay deploys H2O on their data science clusters with Mesos. ShareThis uses H2O for Conversion Estimation in Display Advertising to predict performance indicators such as CPA, CTR and RPM. MarketShare uses H2O to generate marketing plans and What-If scenarios for their customers. Vendavo is using it to build Pricing Engines for products and Trulia for finding fixer-uppers in luxury neighborhoods. Some retailers and insurance companies are using it to do nationwide modeling and prediction of demand to manage just-in-time inventories and recommendations.

H2O Deep Learning is also being used for churn prediction, Higgs particle discovery (following a recent Nature paper), predicting the quality of Bordeaux vintages based on the local weather history, and many more use cases in healthcare, financial markets (time series) and insurance verticals.

H2O really makes it easy to get great results with minimal effort due to its user-friendly web interface, it’s high performance and scalability, and the built-in automation and model tuning options. For example, we currently share the world record on the classic MNIST dataset with 0.83% test set error (for models without distortions, convolutions, or unsupervised learning) obtained with a simple 1-line command from R. H2O also allows you to get fancy, and we have provided various starter R scripts for various Kaggle challenges that beat the various existing benchmarks. We recently hosted some of the world’s best Kagglers who shared some of their secrets on competitive data science.
AR: Q5. What are the best resources to quick-start exploring H2O and engage with the community?

AC: You can literally get started in less than a minute. For R-users, H2O is delivered as a simple R package in CRAN. Or you can download H2Oh2o-deep-learning-book and start it with the “java -jar h2o.jar “ command and point your web browser to http://localhost:54321. Simply run that same command on different machines simultaneously for multi-node operation (or specify a flat file with IP addresses and ports). A similarly simple single command line is used for launching H2O as a Hadoop job. The H2O website has links to many resources, and I'd recommend our documentation, the H2O World tutorials and scripts, as well as the H2O Git Books (there’s one on H2O Deep Learning). We also have great video recordings of past events and presentation slides. I would also encourage the community to join our meetups and events. We typically have at least one meetup per week, and our brand-new office in Mountain View has ample space for meetups!

We also created an H2O user forum to answer community questions. Our bug tracking system is public and you can send questions, requests or comments to customer support( Lastly, you can follow H2O or myself on Twitter.

AR: Q6. How do you see the evolution of High Performance Computing (HPC) in the last decade? Where do you see it headed in the future?

AC: The overall system scalability and power usage are key metrics for HPC systems. A decade ago, the fastest HPC systems were basically tens of thousands of today's iPads clustered together via custom networks. Today, the fastest systems have the equivalent of 100,000 powerful workstations with top-notch GPUs connected through even faster interconnects.

The HPC community has had some time to figure out how to take advantage of massively distributed systems, but now it needs to adapt to new programming models that blend CPUs and GPUs (or accelerators) together. Hiding some of the system complexities such as complicated memory hierarchies or fault tolerance issues from the average application programmer will help this transition.

The key to high performance is to reduce data movement of any sort as much as possible: Hard drives, networks and even main memory kept speeding up at a much slower rate than the number-crunching processing units. Also, latencies have consistently lagged bandwidths across the board.

The thinking now is “compute is free”, “memory access is expensive”, “network or disk access is really painful”, and “if you must send data around, send large chunks and avoid random access patterns”.

Luckily, there have been some exciting algorithmic improvements towards reducing communication overhead for some standard linear algebra problems recently.
The next big frontier in HPC is called Exascale: 1 exaFLOPS (one trillion operations per microsecond) - roughly the equivalent of the raw processing power of a human brain. With today’s technology, it would require a dedicated power plant for its ~200 megawatt power usage. We expect that such a system can become cost-effective enough to operate by the end of this decade.

Note that we already store more than 10,000 exabytes (10 billion terabytes, or more than 1 terabyte per human) on the internet today, and we’re going to have even more stored data per available processing unit a decade from now. It will be interesting to see what the term Big Data stands for then...

The third and last part of the interview.