Interview: Arno Candel, H2O.ai on the Basics of Deep Learning to Get You Started
We discuss how Deep Learning is different from the other methods of Machine Learning, unique characteristics and benefits of Deep Learning, and the key components of H2O architecture.
He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named 2014 Big Data All-Star by Fortune Magazine.
Here is my interview with him:
Anmol Rajpurohit: Q1. How do you define Deep Learning? How do you differentiate it from the rest of Machine Learning technologies?
Compared to more interpretable Machine Learning techniques such as tree-based methods, conventional Deep Learning (using stochastic gradient descent and back-propagation) is a rather “brute-force” method that optimizes lots of coefficients (it is a parametric method) starting from random noise by continuously looking at examples from the training data. It follows the basic idea of “(good) practice makes perfect” (similar to a real brain) without any strong guarantees on the quality of the model.
Today’s typical Deep Learning models have thousands of neurons and learn millions of free parameters (connections between neurons), and yet are not even rivaling the size of a fruit fly’s brain in terms of neurons (~100,000). The most advanced dedicated Deep Learning systems are learning tens of billions of parameters, which is still about 10,000x less than the number of neuron connections in a human brain.
However, even some remarkably small Deep Learning models already outperform humans in many tasks, so the space of Artificial Intelligence is definitely getting more interesting.
AR: Q2. What characteristics enable Deep Learning to deliver such superior results for standard Machine Learning problems? Is there a specific subset of problems for which Deep Learning is more effective than other options?
AC: Deep Learning is really effective at learning non-linear derived features from the raw input features, unlike standard Machine Learning methods such as linear or tree-based methods. For example, if age and income are the two features used to predict spending, then a linear model would greatly benefit from manually splitting age and income ranges into distinct groups; while a tree-based model would learn to automatically dissect the two-dimensional space.
A Deep Learning model builds hierarchies of (hidden) derived non-linear features that get composed to approximate arbitrary functions such as sqrt((age-40)^2+0.3*log(income+1)-4) with much less effort than with other methods. Traditionally, data scientists perform many of these transformations explicitly based on domain knowledge and experience, but Deep Learning has been shown to be extremely effective at coming up with those transformations, often outperforming standard Machine Learning models by a substantial margin.
Deep Learning is also very good at predicting high-cardinality class memberships, such as in image or voice recognition problems, or in predicting the best item to recommend to a user. Another strength of Deep Learning is that it can also be used for unsupervised learning where it just learns the intrinsic structure of the data without making predictions (remember the Google cat?). This is useful in cases where there are no training labels, or for various other use cases such as anomaly detection.
AR: Q3. What are the key components of H2O architecture? What are the unique advantages of using H2O for Deep Learning pursuits?
H2O is designed to process large datasets (e.g., from HDFS, S3 or NFS) at FORTRAN speeds using a highly efficient (fine-grain) in-memory implementation of the famous Mapreduce paradigm with built-in lossless columnar compression (that often beats gzip on disk). H2O doesn’t require Hadoop, but it can be launched on Hadoop clusters by MRv1, YARN or Mesos, for seamless data ingest from HDFS.
H2O and its methods are also backed by venture capital and some of the most knowledgeable experts in Machine Learning: Stanford professors Trevor Hastie, Rob Tibshirani and Steven Boyd. Other independent mentors include Java API expert Josh Bloch and Founder of S and R-core member John Chambers. We’ve literally spent days discussing algorithms, APIs and code together, which is a great honor and privilege. Of course, customers and users from the open source community are constantly validating our algorithms as well.
For H2O Deep Learning, we put lots of little tricks together to make it a very powerful method right out of the box. For example, it features automatic adaptive weight initialization, automatic data standardization, expansion of categorical data, automatic handling of missing values, automatic adaptive learning rates, various regularization techniques, automatic performance tuning, load balancing, grid-search, N-fold cross-validation, checkpointing, and different distributed training modes on clusters for large datasets. And the best thing is that the user doesn’t need to know anything about Neural Networks, there’s no complicated configuration files. It’s just as easy to train as a Random Forest and simply makes predictions for supervised regression or classification problems. For power users, there’s also quite a few (well-documented) options that enable fine-control of the learning process. By default, H2O Deep Learning will fully utilize every single CPU core on your entire cluster and is highly optimized for maximum performance.
I share our CEO and co-founder SriSatish Ambati’s vision that a whole ecosystem of smart applications can emerge from these recent advances in machine intelligence and fundamentally enrich our lives.
Second part of the interview.
Related: