KDnuggets Home » News » 2016 » Feb » Tutorials, Overviews » Amazon Machine Learning: Nice and Easy or Overly Simple? ( 16:n07 )

Amazon Machine Learning: Nice and Easy or Overly Simple?


Amazon Machine Learning is a predictive analytics service with binary/multiclass classification and linear regression features. The service is fast, offers a simple workflow but lacks model selection features and has slow execution times.

By Alex Perrier, @alexip, (originally published on the Open Data Science blog)

Machine Learning as a Service (MLaaS) promises to put data science within the reach of companies. In that context, Amazon Machine Learning is a predictive analytics service with binary/multiclass classification and linear regression features. The service offers a simple workflow but lacks model selection features and has slow execution times. Predictive performances are satisfying.


Data science is hot and sexy, but it is complex. Building and maintaining a data science infrastructure can be expensive. Experienced data scientists are scarce and in-house development of algorithms, building predictive analytics applications, and creating production-ready APIs, requires specific know-how and resources. Even though companies may anticipate the benefits of a data science service, they may not be ready to make the necessary investments without testing the waters first.

This is where Machine Learning-as-a-Service comes in with a promise to simplify and democratize Machine Learning: reap the benefits of Machine Learning within a short timeframe while keeping costs low.

Several key players have entered that field: Google Predictive Analytics, Microsoft Azure Machine Learning, IBM Watson, Big ML and many others. Some offer a simplified Prediction Analytics service while others offer a more specialized interface and data science services beyond prediction.

One relatively new entrant is AWS with its Amazon Machine Learning service. Launched in April 2015, less than a year ago, at the AWS 2015 summit, Amazon Machine Learning aims at simplifying predictive analytics by focusing on the data workflow and keeping the more involved and challenging technical details under the hood. By removing an important part of the technical details from the sight of the user, Amazon Machine Learning brings data science to a much broader audience. It significantly lowers the barrier of entry for companies wishing to experiment with predictive analytics by making powerful Machine Learning tools available and operational in a very short timeframe.

A large portion of the Internet already runs on AWS many services. AWS move to add a Machine Learning offering to the mix will allow engineers to include predictive analytics capabilities into their existing applications.

Amazon Machine Learning enables companies to experiment with data science and assess its business value without commiting significant resources and investments. In that regard, Amazon Machine Learning is Predictive Analytics 101 for companies wishing to board the data science train.

Pistons, Carburetors and Filters: What’s Under the Hood?

One important trait of Amazon Machine Learning is its simplified approach to Machine Learning. It “dumbs down machine learning for the rest of us [InfoWorld]“; it “puts Machine Learning In Reach Of Any Developer [Techcrunch].”

But predictive analytics is a complex field. Tasks such as data munging, feature engineering, parameter tuning, and model selection take time and follow a well established set of protocols, methods and techniques. Can Amazon Machine Learning’s simplified service still deliver performance at the expense of this complexity? Can you still reap the benefits of predictive analytics with a simplified Machine Learning pipeline?

1 Model, 1 Algorithm, 3 Different Tasks, Easy Pipeline Setup, Wizards, and Smart Defaults

According to the documentation, Amazon Machine Learning is based on linear models trained via Stochastic Gradient Descent (SGD for short). That’s it. No Random Forests or Boosted trees, no Kernel SVM, Bayes classifiers or Clustering. This may appear to be a drastic limitation. However the Stochastic Gradient Descent algorithm developed by Leon Bottou is a very stable and resilient algorithm. This algorithm has been around for a long time with many improved versions over the years.

This simple predictive setup will most probably be sufficient to address a large portion of real world business prediction problems. As we will see it also presents decent performances.


The Amazon Machine Learning platform gives you a choice of three supervised learning tasks, each with their associated models and loss functions:

  • binary classification with logistic regression (logistic loss function + SGD)
  • multiclass classification with multinomial logistic regression (multinomial logistic loss + SGD)
  • and regression with linear regression (squared loss function + SGD)

For binary classifier, the scoring function is the F1-measure; for multiclass classifier, scoring is the macro average F1-measure which averages the F1-measure of each class; and for regression the RMSE metric is used. Commonly used in information retrieval, the F1-measure is the harmonic mean of precision and recall. It is a robust classification measure somewhat insensitive to multiclass imbalance.

Feature Engineering with Recipes

Within the Amazon Machine Learning pipeline is the possibility to transform your variables with Recipes. Several transformations are available through JSON formatted instructions: replacing missing values, Cartesian products, binning categorical variables into numerical ones, or forming n-grams for text data.

For instance here is one of the recipe that was automatically generated to transform categorical values into numeric ones when working on the iris dataset.

  "groups" : {
    "NUMERIC_VARS_QB_50" : "group('sepal_width')",
    "NUMERIC_VARS_QB_20" : "group('petal_width')",
    "NUMERIC_VARS_QB_10" : "group('petal_length','sepal_length')"
  "assignments" : { },
  "outputs" : [ "ALL_CATEGORICAL",
  "quantile_bin(NUMERIC_VARS_QB_10,10)" ]

Training vs Validation Sets

By default, Amazon Machine Learning splits your training dataset into 70/30 chunks. Here again Amazon Machine Learning simplifies rich techniques into very simple and limited choices. Splitting your data into training and validation could be done in a myriad ways which Amazon Machine Learning boils down to randomizing the samples or not. You can of course still split your data as you wish outside of Amazon Machine Learning, create a new datasource for an held-out set, and evaluate the performance of your model on this held-out dataset.

SGD Parameter Tuning

A reduced number of parameters are available for tuning your model: the number of passes, the regularization type (None, L1, L2), and the regularization parameter. It is not possible to set up the learning rate of the algorithm and no information is given on how this important parameter is set.