SFBayACM Jan 25: Analytics at Petabyte Scale

Speakers from Cloudera and Facebook talk about Hadoop: Distributed Data Processing and Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop

Jan 25, 2010 SFBayACM.org, Data Mining SIG Analytics at Petabyte Scale: Cloudera and Facebook on Hadoop and Hive in Mountain View, CA, at LinkedIn. Presented by Amr Awadallah, CTO of Cloudera, and Ashish Thusoo, Facebook (Hive project leader at Facebook and Apache).

Cost: Free and open to all who wish to attend, but membership is only $20/year. Anyone may join our mailing list at no charge, and receive announcements of upcoming events.

Speakers: Amr Awadallah, Cloudera, and Ashish Thusoo, Facebook

TITLE 1: "Hadoop: Distributed Data Processing"

Hadoop is an open-source distributed platform designed to economically store and process data using clustered commodity hardware. Hadoop is Apache's implementation of the MapReduce/GFS frameworks popularized by Google. In this talk we will demystify this powerful platform, and describe how it enables you to consolidate many different data storage and processing needs in an economically scalable cloud resource.

TITLE 2: "Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop"

Hive is an open source, peta-byte scale date warehousing framework built on top of Hadoop that enables scalable analytics on large data sets using SQL and some language extensions. Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. This talk will highlight how Hive and Hadoop allow us at Facebook to offer a cheap, scalable and flexible infrastructure to do different kinds of analysis. We will talk about the architecture, applications and capabilities of this infrastructure which handles close to 8000 jobs a day and stores nearly 2.5PB of compressed data.