Berkeley AMPLab: Algorithms, Machines, and People works on Big Data and Analytics Solutions

Berkeley AMPLab provides a start-up environment within Berkeley for exciting Big Data and Analytics projects, including BLB: Bootstrapping Big Data, CrowdDB - Answering Queries with Crowdsourcing, MLbase: A User-friendly System for Distributed Machine learning, and Shark: SQL and Rich Analytics at Scale

AMPLabBerkeley AMPLab (Algorithms, Machines, and People) is a five-year collaborative effort at UC Berkeley, involving students, researchers and faculty from a wide swath of computer science and data-intensive application domains to address the Big Data analytics problem.

Along with traditional research funding agencies, AMPLab is sponsored by, and works with many of the world's leading technology companies and innovative start-ups.

AMPLab developed an impressive number of projects for Big Data, Analytics, and Data Science, including

BLB: Bootstrapping Big Data

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving very large datasets, the computation of bootstrap-based quantities can be extremely computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which combines features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators.

CrowdDB - Answering Queries with CrowdsourcingCrowdDB - Answering Queries with Crowdsourcing

CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. Conceptually, a major change is that the traditional closed-world assumption for query processing does not hold for human input.

MLBaseMLbase: A User-friendly System for Distributed Machine learning

MLbaseis a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

Shark: SQL and Rich Analytics at Scale

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query.

This allows Shark to run SQL queries up to 100x faster than Apache Hive, and machine learning programs up to 100x faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce - like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.