Glimpses & Impressions: Strata Silicon Valley AI + ML Review – Part One

Read some impressions from a visit to Strata Silicon Valley in March. The focus is on integration of data science and machine learning tools, as well as the simplification of related processes.

By Zhang Xiatian, Chief Data Scientist, TalkingData.

At the end of March, I had the good fortune of visiting the Silicon Valley to attend Strata + Hadoop World 2016. As it was my first visit to the Silicon Valley, I made pilgrimages to more than a dozen technology companies. At the Strata conference, I made it a priority to engage in exchanges with machine learning and artificial intelligence technology companies. Two things from this trip left the deepest impressions on me: 1. I realized that start-up companies are the main driving force behind the flourishing field of data science tools. The old guards, the ones that have long been involved in the associated fields, have fallen behind in terms of product function, design philosophy, and user interface. 2. Even though we only visited two AI technology companies (NovuMind and Numenta), we came away with a great deal of insight. We were able to observe firsthand Dr. Ren Wu’s quests to take deep learning to new heights as well a sthe development of the non-DL neural network model known as hierarchical temporal memory. This article expands upon these impressions.

Strata + Hadoop World

I. Trends in data science platform development

The rapid growth of big data and the demand for data science has ushered in higher standards for data science tools and platforms. Commercial statistical analysis and modeling tools such as SAS and SPSS can no longer satisfy market demands. These traditional RDMS-based commercial tools have lost their relative advantage due to the swelling of data size and the proliferation of Hadoop. At the same time, in a market that has seen constant emergence of new machine learning algorithms and tools, commercial software systems have been too slow to adapt and evolve. New open source software has gained wide acceptance over the old analysis and modeling tools. At the same time, however, these new tools all have their own focuses, strong points, and drawbacks. As a result, data science professionals are faced with the challenge of maximum utilization. Given that there is a relative shortage of qualified data science professionals in the current job market, it’s worth our while to examine how to use these new tools effectively and make them more accessible to users. Many data science platform startups have attempted to answer this very question.

During the Strata conference, I compared notes with representatives from a number of companies offering data science platforms, including H2O, Domino Data Lab, Data Robot, Skytree, Anaconda, SAS, and Dato. I viewed several product demos and saw many interesting things, from which I identified five current trends in data science platform development:

  1. data science tools integration,
  2. making the analytical modeling process more efficient,
  3. simplified model deployment,
  4. managing modeling and experiment results,
  5. introducing cooperation mechanisms.

In the following sections, I will explain these trends in greater detail.

1.1 Integration of different data science tools.

Data science tools

As there already are many open source and commercial data science tools on the market, many data science platforms have no desire to reinvent the wheel. Instead, they aim to integrate these different tools onto one single platform. Domino Data Lab, Data Robot, and H2O are all examples of this kind of approach. Even though H2O does offer its own algorithm library, its platform is still compatible with Spark’s MLlib. Data Robots is compatible with H2O, Spark MLlib, Python and R. Domino Data Lab does integration best because aside from many open source tools, it also integrates a number of commercial tools and platforms, including Matlab and SAS. They generally achieve integration in one of two ways. The first is to group these tools’ models into packages and turn them into built-in modules within the platform, while the second is to offer Notebook and enable the users to use these tools by directly coding on the platform (which usually supports Python, R and Spark). These two ways of integration allow us to easily switch back and forth between different tools on a single platform while doing analytical modeling.

I believe that Domino Data Lab has the most advanced technology for platform integration. It uses Docker as a holder to hold different data science tools, an idea that has proven to be both flexible and expandable. Furthermore, Docker’s granularity is down to a single specific model training/test, which means that we can easily use many different tools during a single project.

1.2 High Efficiency in Analytical Modeling

Currently we have identified three technological trends that are associated with highly efficient analytical modeling: visualization pipeline, automatic modeling, and a powerful visualization capacity for data and models.

Using visualization pipeline for analytical modeling, we can quickly establish a process analysis by simple drag-and-drop from data import, data processing, modeling to the last parts. Of course, this is hardly a new invention. SPSS and SAS have been able to do so for years. This is an idea that is widely accepted by the current data science tools, indicating that it significantly contributes to a more efficient analytical modeling process.

DataRobot's automatic training program is often touted as its selling point. After choosing data sets, a DataRobot user is only a single click away from building the optimal model. Perhaps this automation is the reasoning behind DataRobot's name. For a given problem, DataRobot will simultaneously attempt multiple algorithms and configurations, and it can simultaneously train and test more than 1,000 models to determine which one of them is the best. Skytree has applied for patent for its AutoModel, which has similar abilities to DataRobot but to a lesser degree. Skytree's people have told me that Skytree's automation process is not about performing an exhaustive search. On the contrary, it's based on certain search strategies that prune search paths. Like automatic mode on cameras that will allow the average Joe to take decent photos, automatized modeling can make analytical modeling much easier and more efficient, making data science tools much more accessible to non-professionals. At the same time, it also offers a lot to data scientists by enabling them to quickly compare different algorithms, narrow down potential model choices, and make further adjustments based on their selections.

To be fair, automation of model training or model choice is not a new topic; academics have already been researching it for some time. However, up until now, the biggest obstacle standing in the way of its application has been procuring sufficient computing resources. It would take far too long for the older data science tools to train more than a 1,000 models for a single problem like DataRobot can do today. DataRobot and Skytree can support rapid automatic model training and choice because of parallel computing; thanks to big data technology, they can assign a large number of model training tasks to run simultaneously on different nodes.

This improvement in efficiency is realized through a powerful visualization ability. First it's the visualization of data that goes beyond the visualization of basic feature data. The visualization of relationships between feature data and feature variables makes it easy for data scientists to perform feature selection and data processing. What's even more powerful is the visualization of model testing results and model interpretation, as it allows one to see important information such as a model's accuracy and feature significance. It's only fitting that DataRobot calls this function "Model X-Ray"—like x-ray machines, visualization sheds light on black box models.

Continuum Analytics is at the forefront of this visualization wave. Not only do they have dazzling visualization displays (especially for their fine-tuned geographic visualization), they also support interactive data on the visualization interface. As we can select data directly from the interface for further processing, our analytical modeling process becomes markedly more effective. Unlike automatic modeling, it does not seek to remove human input from the process; on the contrary, it incorporates human input in order to achieve improvements in efficiency and quality. Of course, these two philosophies are not inherently at odds with each other. One of them seeks to push the machine's abilities to its absolute limits, while the other seeks to do the same with humans. When we combine the two, what we have is perfection in analytical modeling.