MADlib: Big Data Machine Learning in SQL for Data Scientists

MADlib is open source with commercially usable BSD license; supports Postgres and Pivotal Greenplum DBMS, and provides classification, regression, clustering, topic modeling and other analytics for Big Data.



MADlibMADlib: Big Data Machine Learning
in SQL for Data Scientists

MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.

Key MADlib architecture principles are:

  • Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
  • Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
  • Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
  • Open implementation maintaining active ties into ongoing academic research.

MADlib functionality includes:

  • Classification
  • Regression
  • Clustering
  • Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
  • Association Rule Mining, also called market basket analysis or frequent itemset mining
  • Descriptive statistics
  • Validation

MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.

Learn more and download at

madlib.net/

Mayur Rustagi on LinkedIn also suggested a related MLlib - machine learning library - developed on top of Apache Spark. It leverages the in-memory capabilities of Spark for iterative processing often required in machine learning and graph processing.