Cloud Machine Learning Wars: Amazon vs IBM Watson vs Microsoft Azure

Amazon recently announced Amazon Machine Learning, a cloud machine learning solution for Amazon Web Services. Able to pull data effortlessly from RDS, S3 and Redshift, the product could pose a significant threat to Microsoft Azure ML and IBM Watson Analytics.

In two previous posts, I covered the emerging industry of cloud-based machine learning solutions. First, I covered Microsoft's Azure Machine Learning and IBM's Watson Analytics. Microsoft's Azure ML provides a graphical drag-and-drop interface for connecting preprogrammed components of a data science pipeline together. The service is similar to KNIME and seemed targeted for users who knew just enough to know what to do, but not so much that they would want to code up fresh algorithms. One value added for Microsoft's product is a smooth integration for companies which already have their data stored in Microsoft's Azure compute cloud.

In contrast, IBM's Watson Analytics was less clear about the scope of its services offered. Machine learning functionality was not present outside of a simple regression feature which didn't yet work on large datasets when we tried it and generally the service was limited to very small datasets loaded on .csv files. The product shined most at data visualization, and while it offered a natural language interface for asking questions about data sets, the only questions reliably answered were of the form "how does X vary with Y", where X and Y are any two attributes in the dataset.

In the next post, I described MetaMind, a new startup based on the research of Richard Socher with backing from Khosla ventures. This company took a radically different approach, targeting software developers with products that offered machine learning services via an API. Focused on the capabilities of deep learning systems, MetaMind provides sentiment analysis given text and annotates images with the objects they contain.

Amazon Enters the Fray

Adding to the list of companies competing in this space, Amazon recently announced the launch of Amazon Machine Learning. In contrast to offerings from Microsoft and IBM, Amazon's product has a much more focused mission. While Microsoft is betting that users will drag and drop boxes to perform each step of a data pipeline, and IBM offers an open-ended service for interrogating data, Amazon is focused squarely on a fully automatic tool for supervised machine learning.

Supervised machine learning refers to setting where each datapoint is associated with some target variable. When the target variable is a binary quantity, the problem is called binary classification. When the target is categorical (with more than 2 categories) the problem is called multi-class classification. When the target is real-valued (a floating point number), the problem is called regression. These are the three services offered by Amazon Machine Learning. While many algorithms exist for each of these three tasks, Amazon places minimal responsibility for the algorithm in the users' hands, instead offering a nearly fully automated solution for supervised learning problems.

Data Acquisition
Likely its killer feature, Amazon's machine learning software can load your data from anywhere it might live in its vast network of web services. This includes relational data stored in RDS, csv files stored in S3 or data in Amazon's Redshift data warehouse. Given Amazon's primacy in virtualized web services, it seems this is likely to appeal to internet companies, many of which already have their data in Amazon's ecosystem. For those who want to take the software for a spin but do not have any datasets in Amazon's cloud, they provide a sample dataset (bank.csv) that contains dummy data for bank customers.

One nice feature of Amazon's service is that it automatically combs through the data, identifying which fields are numerical and which categorical. Further preprocessing (whitening and dimensionality reduction) are presumably performed automatically (the service never troubles the user to select preprocessing options). It seems that Amazon has astutely surmised that the user who farms his machine learning tasks to a service provider is unlikely to have strong preferences about data preprocessing methodology.