Meetup/Webcast Apr 23: Shark Data Analytics Stack on a Hadoop Cluster

Data Science Meetup: "Shark Data Analytics Stack on a Hadoop Cluster", April 23, 2013, 6 pm MT in Denver, CO - Free and open to all. Live webcast for folks unable to attend in-person.

Data Science Meetup"Shark Data Analytics Stack on a Hadoop Cluster"

We look forward to meeting you at this must-attend Big Data Week Event on Spark Shark Hadoop. Big Data Week is one of the world's most unique global platforms, offering a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data. Free and open to all.

Register Now at

For folks unable to attend register and we will email you a live webcast link 2 hours prior to start.

University of Colorado Denver - Tuesday April 23, 2013, 6:00pm MT Large auditorium (170 person capacity) with 20' screen.

Location: CU Denver - North Classroom #1539 - 1200 Larimer Street Denver, CO 80217-3364 - Map:

Data scientists need to be able to access and analyze data quickly and easily. The difference between high-value data science and good data science is increasingly about the ability to analyze larger amounts of data at faster speeds. Speed kills in data science and the ability to provide valuable, actionable insights to the client in a timely fashion can mean the difference between competitive advantage and no or little value-added.

One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.

The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.

Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries.

Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.

This presentation covers the nuts and bolts of the Spark, Shark and Mesos Data Analytics Stack on a Hadoop Cluster. We will demonstrate capabilities with a data science use-case.

For more information and to register, visit