Big Data BootCamp Santa Clara: Highlights of talks on Days 1-2

Highlights from the presentations by big data technology practitioners from Caspida, Datastax, ElephantScale, Hortonworks, MapR and Qubole at Big Data Bootcamp 2014 in Santa Clara.

Big Data BootcampBig Data Bootcamp 2014 (Apr 23-25, 2014) was organized by Global Big Data Conference at Santa Clara Convention Center in Santa Clara, CA. This 3 day extensive, fast paced and vendor agnostic bootcamp provided a comprehensive technical overview of Big Data landscape to the attendees. Big Data Bootcamp was targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. It brought together data science experts from industry for three days of insightful presentations, hands-on learning and networking. It covered a wide range of topics including Hadoop, Map Reduce, Amazon EC2, Cassandra, YARN, Pig, different use cases and much more.

Despite the great quality of content as well as speakers, it is hard to grasp all the information during the bootcamp itself. KDnuggets helps you by summarizing the key insights from all the sessions at the bootcamp. These concise, takeaway-oriented summaries are designed for both – people who attended the bootcamp but would like to re-visit the key sessions for a deeper understanding and people who could not attend the bootcamp. As you go through it, for any session that you find interesting, check KDnuggets as we would soon publish exclusive interviews with some of these speakers.

Here are highlights from selected talks on day 1 (Wed Apr 23):

Karthik KannanKarthik Kannan, Founder, CMO, Caspida, gave an in-depth explanation of Hadoop Ecosystem describing various technologies and platforms. He mentioned that Algorithms (ML or Statistical), Databases (NoSQL, Columnar, In-memory) and Packages & Programming Languages(R, Java, etc.) are equally important. Depicting dissection of “Hadoop Stack” into various layers and components, he described the Hadoop stack in detail. The top fields where Hadoop is running in a virtualized infrastructure are Security, Advertising, eCommerce and Customer Experience Management. Machine Learning is being used aggressively in industry to predict user behavior and serve advertisements/recommendation accordingly.

Sridhar ReddySridhar Reddy, Director, MapR Technologies gave a talk introducing Apache HBase and MapR Tables. He started by describing the evolution of technologies from RDMS to NoSQL to column family databases. Meanwhile, he also discussed key differences between these technologies and motivation behind this evolution. Introducing HBase as distributed column oriented database built on top of HDFS, he explained the HBase Data Model – Row Keys, Columns, Cells. After briefly explaining Data Storage and Cell Versioning, he stated that HBase Table is just a Sorted map of rows. He also presented an in-depth view of HBase Architecture covering all components one by one.

Quickly covering schema designs, he moved to some interesting use cases. One of them was messaging on Facebook. Being backed by HBase, messaging on Facebook includes communication coming from email, SMS, Facebook Chat and the Inbox with over 2 PB of data in HBase. The key features for which Facebook uses HBase are: Horizontal Scalability, Automatic Failover and simpler Consistency Model. At the end, he asked attendees to get started with HBase using MapR Sandbox describing how easy it is to use.

Here are highlights from selected talks on day 2 (Thurs Apr 24):

Albert TobeyAlbert Tobey, Senior System Administrator, Datastax started Day 2 with Workshop on Cassandra. He commenced by showing how traditional solutions were no longer a good fit for present scenario. Therefore, Casssandra came up as a massively scalable open source NoSQL database. In Cassandra, all nodes participate in a cluster and any node can be added or removed as needed. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times. Cassandra also provides built-in and customizable replication, which stores redundant copies of data across nodes that participate in a Cassandra ring. This means that if any node in a cluster goes down, one or more copies of that node’s data is available on other machines in the cluster. Replication can be configured to work across one data center, many data centers, or multiple cloud availability zones.

Vinod KumarVinod Kumar from Hortonworks gave a talk on YARN and Apache Hadoop 2.0 describing migration from 1.0 to 2.0. Presenting Apache Hadoop 2 as next generation Big Data platform, he explained YARN in detail as it is one of the crucial developments in the newer version. Explaining the improvements due to the introduction of YARN, he described the key benefits of YARN including scalability, agility, and improved cluster utilization. Hadoop 2.0 is a multi-purpose platform supporting diverse range of applications - batch, interactive, online, streaming, etc.

Migrating to latest version of Hadoop is a big ROI as the throughput doubles on the same hardware. He then quickly went through steps to help administrators migrate their clusters to Hadoop 2.x. He also shared details on how users can migrate their applications to Hadoop-2.x. The latest version, Apache Hadoop 2.3 was released on February 24, 2014 introducing a number of alpha features in YARN along with some bug fixes and enhancements. Finally, he mentioned that Apache Hadoop 2.4 would be released very soon and highlighted its new features.

Sujee ManiyamSujee Maniyam, Founder, Principal at ElephantScale talked about breaking into Big Data. He started by claiming that in 2014 Hadoop has started playing a large role in “Enterprise” with lots of tools and features on Enterprise demands under development. He emphasized that mastering Big Data takes great skills, and experience counts a lot. He stated the skill sets required for one to be Data Scientist as: R, python, statistics, math, domain knowledge and big data tools. He suggested audience to start self-learning as there are ample good resources available online. The only challenge in self learning is that one needs to spend a lot of time while doing it alone. Networking and familiarity to open source solutions are very important and significantly increase the chances of finding a good job.

Ashish DubeyAshish Dubey, Solution Architect at Qubole delivered a workshop on Hive. Introducing Hive as SQL on Hadoop, he explained Hive as a system for managing and querying unstructured data as if it were structured. It uses Map-Reduce for execution and HDFS for storage. Key building principles involve extensibility, interoperability and performance. With exponentially increasing data and widening acceptance of Hadoop, Hive emerged as a solution providing simplicity and making Map-Reduce easy to program.

Founded at Facebook, Hive also has tables analogous to those in relational databases. Ashish then explained various functionalities which Hive offers such as Sort By, Custom Mappers/Reducers, Cluster By, Dynamic Partitioning, etc. He also highlighted the current challenges with Hive including batch Processing Model, not great for real-time transaction queries and no support for updates (but partitions help). At the end, he talked about two in-memory solutions: Persto, and Impala.

Next part: Highlights of talks on Day 3