Exclusive Interview with Alexander Gray, Skytree CEO: Fast, Automated, Machine Learning Software for Free?

We discuss how Skytree compares with competition, how does it perform relative to expert Data Scientists, how does Skytree Automodel compare to Deep Learning, and more.

SkytreeSkytree recently released a free easy-to-use machine learning interfaces, which include Python SDK and REST API, and GUI, as well as Unix command line interface.

This is a very interesting development for Data Scientists, and I recently had a chance to discuss it with Alexander Gray, CEO of Skytree, which he co-founded in 2010. Until recently he served as an Associate Professor at Georgia Tech. His research work aims to scale up all of the major practical methods of machine learning to massive datasets as well as develop new statistical methodology and theory, and has developed a number of current state-of-the-art algorithms for several key problems.

Alexander began working with massive scientific datasets in 1993 at NASA JPL Machine Learning Systems Group. High-profile applications of his large-scale ML algorithms have been described in Science and Nature, including contributions to work selected by Science as the Top Scientific Breakthrough of 2003. He has won or been nominated for a number of best paper awards in statistics and data mining and is a recipient of NSF CAREER Award and a National Academy of Sciences Kavli Scholar. He received PhD in Computer Science from CMU.

Here is my interview.

Gregory Piatetsky: There is a lot of demand for Automating Data Science, which is a more scalable approach to solving a shortage of Data Scientists. However you have a lot of competition from companies and tools like DataRobot, Automatic Statistician, KXEN (now part of SAP), and others. How is Skytree product better and different from competition?

Alexander Gray, Skytree CEO Alexander Gray: As you know, I spent a number of years at NASA's Jet Propulsion Lab, in its pioneering Artificial Intelligence and Machine Learning Systems groups. Even as long ago as then (over two decades ago), we had built systems which automated data science -- some successful, some less so. I'm glad to see newcomers make attempts at this very large and difficult problem. In my humble experience though, a number of important aspects are needed, which informs our philosophy and approach. Data scientists are smart, and don't ultimately want to close their eyes and hope the automation will do everything perfectly. Our (patent-pending) approach offers a smooth continuum between full control and full automation. Because true automation of ML is non-trivial, some of the approaches you mentioned focus on a niche in terms of techniques -- for example Gaussian processes or linear methods. Our approach is to offer a broad range of more general techniques to cover the kinds of ML problems that most people have, including both parametric and nonparametric methods.

GP: How does Skytree AutoModel perform relative to expert Data Scientists - eg , what would be its results in some Kaggle competitions or KDD Cups?

AG: In our own extensive experiments, the results produced by our AutoModel capability are generally indistinguishable from those that our own expert data scientists or our clients can get, if not better, and in less time. Mathematically speaking, if you let it run long enough, it will keep exploring more and more of the space of possible models, and the meta-algorithms are such that the models will keep improving until they eventually reach the best possible error the data can support (i.e. the Bayes error in a classification problem).

Skytree Automodel
Fig 1: Skytree Automodel (GBT) performance (Gini score) as the number of automated experiments increases along the x axis. It explores over the space of ML methods - in this case GBT (gradient boosted trees), GLMC (generalized linear model classifier), and SVM (linear in this case) - as well as the parameter space within each method.

Thus, it's almost inevitable that the results will meet or exceed what even an expert can achieve. We see this in our validations on past competition data. Now that we've released a free version of some of our capabilities, we'll be looking to the community to give us feedback on their experience with the free software in competitions, which will help us to keep improving it and hopefully raising the bar on what it takes to win competitions.

GP: How does it compare vs Deep Learning methods?

AG: Deep learning is powerful, but ultimately just one of many possible nonparametric machine learning methods. In our experience, it happens that for the majority of common business use cases, deep learning doesn't end up being the best overall choice, but that is the subject of a different interview... AutoModel's suite of methods will ultimately include deep learning as well. Then users will not have to keep wondering when deep learning is the best approach and when it isn't.

GP: How long will it take until Deep Learning approaches are also automated?

AG: Our approach to automation extends easily to deep learning, so stay tuned for this down the line.

GP: Why will data scientists eventually want to upgrade to the full version?

AG: The free community edition of our software is a single user desktop version (single personal computer) that can be used to train an ML model on up to 100 million data elements (number of data elements = rows x columns). Support for the free version is via our Skytree user community, which will also be monitored by Skytree experts for assistance. The first free version that we just put out last week includes our command line interface, but not initially our graphical user interface or Python SDK, but we are planning on releasing free versions of those as well in March. The paid-for version has the data cap lifted. It also can be deployed in a shared, distributed/cluster environment. The paid-for version includes direct support from Skytree with SLAs. So, Data scientists will want to upgrade if they are building models from the larger data sets (over 100 million elements), want to use a single shared instance, want to use the multi-node distributed capability, and/or want more dedicated customer support.