MLlib: Apache Spark component for machine learning

MLlib, the machine learning component of Apache Spark, has developed into a tool that supports many common machine learning algorithms and now comes with more mature documentation and a stable API.



By Daniel D. Gutierrez July 2014

The Hadoop Summit 2014 in San Jose (June 3-5) brought many innovations to the Hadoop ecosystem, but the one I was most eager to hear about was what was happening with the MLlib component of Apache Spark. Spark 1.0.0 was released just before the conference on May 30 (and a new 1.0.1 release found its way out on July 11).

In case you’re not familiar with Spark and MLlib, let me get you quickly up to speed. Spark is a distributed in-memory computation framework, and the project is almost a year old. Apache Spark provides primitives for in-memory cluster computing which is well suited for large-scale machine learning purposes. MLlib is a Spark component focusing on machine learning, with many developers now creating practical machine learning pipelines with MLlib. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution for the Spark subproject was from UC Berkeley AMPLab. Due to the rapid adoption of Spark, MLlib has received more and more attention and contributions from the open source machine learning community. At this time, 50+ developers from the open source community have contributed to its codebase. MLlib has features for classification, regression, collaborative filtering, clustering, and decomposition (SVD and PCA).

MLlib, Apache Spark component for machine learning

What’s New in MLlib

With the release of Spark 1.0, there are some exciting new features in MLlib. Here is a run-down of all the available machine learning functionality in Spark v0.8 and what’s new in v1.0. Note that the new decision trees feature includes support for both continuous and categorical features variables.

Classification – v0.8: logistic regression, linear support vector machines (SVM), and new in v1.0: naïve Bayes, decision trees.
Regression – v0.8: linear regression, and new in v1.0: regression trees.
Collaborative Filtering – v0.8: alternating least squares (ALS).
Clustering – v0.8: k-means for unsupervised machine learning applications.
Optimization – v0.8: stochastic gradient descent (SGD), and new in 1.0: limited-memory BFGS (L-BFGS). This is a limited memory version of an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. It is a popular algorithm for parameter estimation in machine learning.
Dimensionality Reduction – new in v1.0: singular value decomposition (SVD), and principal component analysis (PCA).

Here’s what else is new with MLlib for Spark:

New user guide and code samples – The new release has improved organization, and the code examples are useful as templates for standalone applications. The \examples folder has examples for various algorithms along with sample data sets.
API stability – Following Spark core, MLlib is guaranteeing binary compatibility for all 1.x releases on stable API. For changes in experimental and developer APIs, a migration guide between releases will be provided. There is now availability of unified API docs for various Spark components.
Sparse data support - MLlib now includes full support for sparse data in Scala, Java, and Python (previous versions only supported it in specific algorithms like alternating least squares). It takes advantage of sparse features in both storage and computation in methods including SVM, logistic regression, Lasso, naive Bayes, k-means, and summary statistics.
Distributed matrices – A distributed matrix has long-typed row and column indices and double-typed values, stored in a distributive manner in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Spark v1.0 implements three types of distributed matrices in this release and will add more types in the future – RowMatrix, IndexedRowMatrix, and CoordinateMatrix.

The Future with MLlib v1.1

Spark has a 3-month release cycle with a cut-off date of July 25 for new features. Here’s what is in the works for the next release:
  • Standardized interfaces
  • Train multiple models in parallel
  • Statistical analysis
  • Learning algorithms
  • Optimization algorithms


Conclusion

Most everyone who has taken a look at MLlib expects it to continue to evolve quickly. There was much banter thrown around at the Hadoop Summit regarding what might be beyond MLlib v1.1. Here are some areas that may receive attention moving forward: scalable implementations of well-known and well-understood machine learning algorithms, user friendly documentation and consistent APIs, better integration with Spark SQL, Streaming, and GraphX, addressing practical machine learning pipelines. If only a fraction of these areas come to fruition, the future of MLlib is destined to be bright.

Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics. He is also a well-recognized Big Data journalist.

Related: