IBM Watson Analytics vs. Microsoft Azure Machine Learning (Part 1)
IBM Watson Analytics prototype seeks to abstract away data science, taking ordinary natural language queries and answering them based on the content of uploaded datasets. Microsoft Azure Machine Learning goes the opposite route, streamlining existing data mining methodology for fast results and integration with MS's other cloud services.
Last week, IBM released a public beta of Watson Analytics, a platform for data exploration, visualization and predictive analytics. This product follows on Microsoft's Azure Machine Learning service, which provides cloud-based machine learning solutions.
Interested to see how the offerings compare, I set up accounts with both services and set out to explore several datasets. For fairness, I should note that IBM's Watson analytics is in a public beta, while Microsoft's product is a significantly more mature offering.
Besides relative maturity, the more striking difference between the products is the fundamentally different use cases they address. Microsoft's Azure Machine Learning software is a tool that automates some tasks in the machine learning pipeline, assuming familiarity with basic data science techniques. It presents a user-friendly, graphical drag and drop interface to route data through various preprocessing steps and ultimately into a machine learning algorithm. In some ways, this product bears resemblance to Pure Data, Miller Puckette's signal processing library for electronic music synthesis which provides a similar interface for routing audio signals through various compute modules.
Watson, on the other hand, takes after the Jeopardy-playing bot for which it's named. IBM Watson Analytics offers an interface through which data is deposited and plain English questions are asked. A query to Watson's data exploration tool might be "How does income depend on age?" In the title of her recent post on KDnuggets, Ran Bi asks of Watson, "Will it Replace Data Scientists?" In a sense, Watson manifests this ambition more than its Microsoft counterpart. Fortunately for data scientists, that vision is a long way from fruition. It is important to note that neither system develops novel algorithms for a task. While such systems may lower the skill level required to perform out-of-the-box machine learning, it seems unlikely in the foreseeable future that such a service would be useful for tasks where prepackaged solutions are insufficient.
Getting Started with Watson Analytics
Getting through the signup page for IBM's Watson Analytics was buggy, but as the product is in beta this is expected. After login, the service presents a well-produced video describing the service. They note that datasets, which should be uploaded in csv or MS Excel (.xls) formats. They are currently restricted to a maximum of 12MB and can contain no more than 50 columns. This clearly would be unacceptable for machine learning at scale, but again, this is a prototype. The limit to csv and .xls document formats would also seem to preclude working efficiently with sparse data. Aside from the superficial limitations at this stage, it seems that as of this iteration, Watson is targeted more towards putting basic analytics tools in the hands of everyone than towards providing large-scale cloud-based machine learning for enterprises. To that end, the sleek interface and data visualizations are aesthetically stunning. It is not hard to imagine non-technical business-people routinely uploading excel sheets to perform simple data analysis and create high quality presentation materials.
The video introduces the three core components of their product (two of which are implemented): Explore, Predict, and Assemble. The "Assemble" feature was called "Author", which is how it appears in the video, until recently.
Explore offers a tool in which to interrogate the data. A user can select from one of many sample queries or type one in free-text. This would seem to be very powerful tool, but for now the only queries we could ask were of the nature "how does variable x depend on variable y?" If the variable y is categorical this results in a simple bar graph with two bars. We tried to ask how a continuous variable compared to another continuous variable ("how does shell weight relate to viscera weight?") but Watson only provided an answer when we conditioned on a categorical variable.
Predict is the tool that Watson provides to predict one or more target variable based on the others. This corresponds to classification or regression depending on whether the target variable is categorical or continuous. Unfortunately, the service crashed whenever we tried to run an experiment, so we could not determine the efficacy of this tool or compare it to the results achieve by Microsoft's classifiers. We hope to rerun these experiments when the offering is more mature and produce such a comparison.
Assemble provides an interface to create "authored workbooks". These contain presentation materials, data visualizations, and reports. As of the video production, this tab is a placeholder and the feature has not yet been released.
Prior to using any of these features, one must upload a dataset. Watson provides a simple drag and drop interface, but we encountered some trouble uploading datasets. We uploaded a few standard UCI sets, including as "abalone" and "Adult Census Income". Additionally, one must make sure that the first column in the CSV contains attribute names, we found no interface to annotate the columns after uploading. As a result we had to add the annotations and re-upload.
Getting Started with MS Azure Machine Learning
Like IBM, Microsoft offers a free trial (links to price information for their machine learning service pointed to their general Azure pricing structure). Once logged in, one is presented with a simple blank screen with three tabs for "Experiments", "Web Services", and "Settings".
To begin an experiment, one clicks on a button to start a new experiment. Azure then presents a blank canvas with a menu of modules along the left side of the screen. A module is represented as a rectangular block with some number (usually 0-2) of incoming ports and outgoing ports. A dataset has no incoming ports and a single outgoing port. Building an experiment consists of selecting a dataset and dragging it onto the canvas. Additional modules can then be dragged onto the canvas and connected to each other with directed edges (from an outgoing port of one module to the incoming port of another). For convenience, Microsoft has pre-populated each account's dataset repository with a a large sampling of UCI datasets. Preprocessing modules include missing data scrubbing, feature selection via linear discriminant analysis, duplicate column detection and more. Supported algorithms include multi-class neural networks, logistic regression, boosted decision trees, support vector machines, and locally deep support vector machines.
Building and testing a model is fairly simple after perusing a few examples. Overall, Azure is a mature and impressive service. It requires knowledge of the characteristics of machine learning algorithms, and certainly will not develop new ones automatically. But it does provide an environment where machine learning could be used effectively without low-level implementation or algorithmic knowledge. For the savvy end-user data scientist, Azure could potentially eliminate many hours of repetitive work by automating data preprocessing and providing a convenient environment in which to quickly explore data and generate predictions. By integrating with Azure's other products, this product could also potentially ease the path to incorporate machine learning and predictive analytics into the workflow of many companies.
IBM's Watson Analytics and Microsoft's Azure Machine Learning present very different products, albeit with some overlapping features. While IBM seeks to make it possible for anyone at all to interrogate data, the product hasn't fully materialized yet. Azure on the other hand sets the more modest goal of wrapping a user-friendly interface around machine learning tasks, and integrating machine learning into existing business workflows. The design is clever and impressive. We look forward to seeing how these services evolve and comparing their capabilities on more sophisticated tasks soon.Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.
- Will IBM Watson Analytics Replace Data Scientists?
- IBM Watson’s Next Step: Partnership with Universities
- When Watson Meets Machine Learning
- Why Azure ML is the Next Big Thing for Machine Learning?
- Interview: Joseph Sirosh, Microsoft on How Azure ML is Simplifying Predictive Analytics