MADlib: Big Data Machine Learning in SQL for Data Scientists
MADlib is open source with commercially usable BSD license; supports Postgres and Pivotal Greenplum DBMS, and provides classification, regression, clustering, topic modeling and other analytics for Big Data.
MADlib: Big Data Machine Learning
in SQL for Data Scientists
MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.
Key MADlib architecture principles are:
- Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
- Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
- Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
- Open implementation maintaining active ties into ongoing academic research.
MADlib functionality includes:
- Classification
- Regression
- Clustering
- Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
- Association Rule Mining, also called market basket analysis or frequent itemset mining
- Descriptive statistics
- Validation
MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.
Learn more and download at
Mayur Rustagi on LinkedIn also suggested a related MLlib - machine learning library - developed on top of Apache Spark. It leverages the in-memory capabilities of Spark for iterative processing often required in machine learning and graph processing.