MADlib: Big Data Machine Learning in SQL for Data Scientists
MADlib is open source with commercially usable BSD license; supports Postgres and Pivotal Greenplum DBMS, and provides classification, regression, clustering, topic modeling and other analytics for Big Data.
MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.
Key MADlib architecture principles are:
- Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
- Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
- Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
- Open implementation maintaining active ties into ongoing academic research.
MADlib functionality includes:
- Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
- Association Rule Mining, also called market basket analysis or frequent itemset mining
- Descriptive statistics
MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.
Learn more and download at
Mayur Rustagi on LinkedIn also suggested a related MLlib - machine learning library - developed on top of Apache Spark. It leverages the in-memory capabilities of Spark for iterative processing often required in machine learning and graph processing.