Glimpses & Impressions: Strata Silicon Valley AI + ML Review – Part One
Read some impressions from a visit to Strata Silicon Valley in March. The focus is on integration of data science and machine learning tools, as well as the simplification of related processes.
1.3 Simplifying Model Deployment
The current open-source data science tools are typically weak in terms of model deployment. The older commercial software usually uses PMML to do model export and deployment. However, this format has proven to be less than ideal in practical terms. Many companies, when they want to do model deployment, use their in-house developers' codes instead of PMML-supporting commercial solutions. Under these circumstances, doing PMML analytics can be counterproductive. The platforms we have seen during our visit, on the other hand, tend to use a simpler and more direct solution—they directly compile Java, Python, C/C++, etc. The user can either directly copy the code from the program's interface or export it in file format. This makes model deployment much easier.
Aside from direct code exportation, Domino Data Lab also offers a simpler and more efficient method of deployment. With one click, we can deploy a trained model in the form of a RESTful API for others to call on. This is valuable not only because of its simplicity, but also because it has expanded the boundaries of data science from analytical modeling to model deployment and application, seamlessly linking together analytical modeling with prediction. Domino Data Lab has used the platform known as Docker to run its models, thereby ensuring rapid model deployment and completion.
1.4 Model and Experiment Results Management
For every data science project, there are many different models and sets of experiment results. These data science platforms all have very good management systems for trained models and test results with an auto-save function. This means data scientists can access these models and results at any time. When we use open source tools, we often have to manage these data manually, which is not only inefficient but also error-prone. We can regard this auto-save function as an accumulation of knowledge and experiences. It's much easier to solve new problems if we have similar old ones for reference.
Skytree has also made it easier to compare different sets of models and experiment results. Like Autohome's car model comparison function, Skytree lets its users compare multiple different models on the same interface. This makes it easier for data scientists to systematically compare the advantages and drawbacks of different models.
Here I must single out Domino Data Lab as an industry leader in this respect: Aside from managing models and experiment results, Domino Data Lab also stores these experiments' contexts—including hardware, software, data used, parameter settings, and experiments. Of course, this function is also achieved through Docker.
1.5 Cooperative Mechanism
Another feature that sets Domino Data Lab apart is that it enables data scientists to do cooperative work. The comment function is enabled for each step of the project, from authorization management to each set of data, model, and results. This makes it convenient for data scientists to cooperate on data science projects, and it enables easy sharing of both knowledge and experience. This is very helpful for making data science teams more efficient overall.
Aside from these clearly overwhelming trends in data science platform development, Continuum Analytics is a company that deserves special mention, for its data science platform solution is very different from the rest of the pack. While others are working on integrating as many diverse tools as possible, Continuum is committed to working exclusively with Python. Its two founders, Travis Oliphant and Peter Wang, are both experienced Python experts. In addition, there are many Python contributors working at the company. We were lucky to have Peter explaining things personally to us and being able to attend his demo show. Through his intense enthusiasm, we felt his passionate love for Python as a platform. Continuum Analytics offers Anaconda, a Python distribution that includes more than 300 Python packages. Anaconda simplifies Python data science package management and deployment. Furthermore, in order to overcome Python's weakness in its support for big data, they have offered parallel processing support for many Python packages related to data science. At the Strata conference, Continuum had a fully functional 12-device Raspberry Pi 2 cluster running Anaconda. They also left a deep impression with their aforementioned ability of data visualization.
Aside from these data science platforms’ functions and technologies, I was also very interested in their current clientele makeup. During the Strata conference, my last two questions for all of these companies were "Who are your main clients at this moment?" and "Do you have clients in China?" Almost all of them answered "finance company" to the first question and "no" to the second one. We see that in the U.S., aside from the Internet companies (which rarely shop for commercial software and platforms as they have the best in-house engineers on the planet), the undeniably booming financial industry has the highest demand for data scientists. We at TalkingData have felt that China's finance companies have a similar demand/need for data science. However, these data science platform companies have yet to establish their presence in China. Perhaps this represents an entrepreneurial opportunity for the Chinese.
Bio: Zhang Xiatian has long engaged in data mining and machine learning research and has dozens of research papers in publication and sufficient patents. Xiatian is now fully responsible for mobile big data mining, ML algorithm research and implementation in TalkingData. He used to work for IBM China Research Institute, Tencent data platform, Huawei Noah's Ark Lab.