Interview: Arno Candel, H20.ai on How to Quick Start Deep Learning with H2O
We discuss H2O use cases, resources to start using H2O for Deep Learning, evolution of High Performance Computing (HPC) and the future of HPC.
He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named 2014 Big Data All-Star by Fortune Magazine.
First part of interview.
Here is second part of my interview with him:
Anmol Rajpurohit: Q4. What are your favorite H2O use cases?
For example, Cisco built a Propensity to Buy Model Factory using H2O. Paypal uses H2O for their Big Data Analytics initiatives and H2O Deep Learning for Fraud Detection. Ebay deploys H2O on their data science clusters with Mesos. ShareThis uses H2O for Conversion Estimation in Display Advertising to predict performance indicators such as CPA, CTR and RPM. MarketShare uses H2O to generate marketing plans and What-If scenarios for their customers. Vendavo is using it to build Pricing Engines for products and Trulia for finding fixer-uppers in luxury neighborhoods. Some retailers and insurance companies are using it to do nationwide modeling and prediction of demand to manage just-in-time inventories and recommendations.
H2O Deep Learning is also being used for churn prediction, Higgs particle discovery (following a recent Nature paper), predicting the quality of Bordeaux vintages based on the local weather history, and many more use cases in healthcare, financial markets (time series) and insurance verticals.
H2O really makes it easy to get great results with minimal effort due to its user-friendly web interface, it’s high performance and scalability, and the built-in automation and model tuning options. For example, we currently share the world record on the classic MNIST dataset with 0.83% test set error (for models without distortions, convolutions, or unsupervised learning) obtained with a simple 1-line command from R. H2O also allows you to get fancy, and we have provided various starter R scripts for various Kaggle challenges that beat the various existing benchmarks. We recently hosted some of the world’s best Kagglers who shared some of their secrets on competitive data science.
AR: Q5. What are the best resources to quick-start exploring H2O and engage with the community?
AC: You can literally get started in less than a minute. For R-users, H2O is delivered as a simple R package in CRAN. Or you can download H2O
We also created an H2O user forum to answer community questions. Our bug tracking system is public and you can send questions, requests or comments to customer support(support@0xdata.com). Lastly, you can follow H2O or myself on Twitter.
AR: Q6. How do you see the evolution of High Performance Computing (HPC) in the last decade? Where do you see it headed in the future?
AC: The overall system scalability and power usage are key metrics for HPC systems. A decade ago, the fastest HPC systems were basically tens of thousands of today's iPads clustered together via custom networks. Today, the fastest systems have the equivalent of 100,000 powerful workstations with top-notch GPUs connected through even faster interconnects.
The HPC community has had some time to figure out how to take advantage of massively distributed systems, but now it needs to adapt to new programming models that blend CPUs and GPUs (or accelerators) together. Hiding some of the system complexities such as complicated memory hierarchies or fault tolerance issues from the average application programmer will help this transition.
The key to high performance is to reduce data movement of any sort as much as possible: Hard drives, networks and even main memory kept speeding up at a much slower rate than the number-crunching processing units. Also, latencies have consistently lagged bandwidths across the board.
The thinking now is “compute is free”, “memory access is expensive”, “network or disk access is really painful”, and “if you must send data around, send large chunks and avoid random access patterns”.
Luckily, there have been some exciting algorithmic improvements towards reducing communication overhead for some standard linear algebra problems recently.
The next big frontier in HPC is called Exascale: 1 exaFLOPS (one trillion operations per microsecond) - roughly the equivalent of the raw processing power of a human brain. With today’s technology, it would require a dedicated power plant for its ~200 megawatt power usage. We expect that such a system can become cost-effective enough to operate by the end of this decade.
Note that we already store more than 10,000 exabytes (10 billion terabytes, or more than 1 terabyte per human) on the internet today, and we’re going to have even more stored data per available processing unit a decade from now. It will be interesting to see what the term Big Data stands for then...
The third and last part of the interview.
Related: