Big Data Bootcamp, Austin: Day 3 Highlights
Highlights from the presentations by Big Data and Analytics leaders/consultants on day 3 of Big Data Bootcamp in Austin.
Big Data Bootcamp was held in Austin during Apr 10-12, 2015. It provided a platform for leading experts to share interesting insights into the innovations that are driving success in the world's most successful organizations. Budding data scientists as well as decision makers from a number of companies came together to get technical overview of the Big Data landscape. The camp was targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. Through practice on interesting examples, attendees got hands on experience of popular technologies in Big Data Space like Hadoop, Spark, HBase, MapReduce, etc.
Highlights from day 3:
Aaron Benz, Data Scientist, Accenture talked about fast time-series analytics with HBase and R. Aaron started with explaining that HBase is needed when one needs random, read/write access to Big Data. HBase is open-source, distributed, versioned, non-relational database modeled after Google's Big-Table. HBase contains a map (key-value pairs) at its core. It provides persistent, consistent, distributed and sorted storage. It provides multi-dimensional key-value pairs i.e. each key can have many different key-value pairs.
The role of HBase within the Hadoop system involves the co-ordination and scaling of data storage and processing. He shared certain use cases where HBase comes to rescue and works as a charm. HBase fulfills various typical storage requirements such as fast reads and writes on the fly, ability to access data store directly from R and cost-effective scalability. Thrift API is needed for R connection to HBase.
Discussing about other solutions that can be considered in Hadoop eco-system, he mentioned that Hive is relational data warehouse of the Hadoop framework and does not fulfill requirement mentioned before. Using Thrift API to connect to HBase, one can work with HBase in R. Using a code snippet, Aaron showed that how loading R objects into HBase and retrieving data from HBase is easy with rhbase. Additional optimizations can be done in R to get results faster.
Eddie Satterly, CTO, Infochimps started his talk with brief discussion on Big Data. He mentioned that machine generated data is one of the fastest growing, most complex and most valuable segments of big data. The talk focused on various types of Big Data solutions—from open-source to commercial solutions—and the specific selection criteria and profiles of each. As in all technology areas, each solution has its own sweet spots and challenges either in CAP theorem, ACID compliance, performance or scalability.
He discussed the problems with RDMBS and how sharding helps. Most NoSQL solutions are auto-sharded. Briefly talking about Hadoop and its components, he put light on what makes it so much different and widely usable. At the core of Hadoop lies HDFS, a self-healing and high bandwidth clustered storage.
Sqoop is a tool to import/export any JDBC -supported database into Hadoop and it transfers data between Hadoop and external databases or EDW. Flume is log file collector. Storm is used for real-time streaming and is made up of topologies of spouts (accepts stream) and bolts (in-stream processing). Similarly, he described Pig, MapReduce, YARN. Regarding NoSQL, he discussed about Apache Cassandra and MongoDB.
Ameet Paranjape, Solution Architect, Hortonworks talked about how to define strategy to implement Hadoop, build a use case and maximize return. He started his talk with sharing some big data trends and predictions. Traditional systems are under too much pressure as they have constrained schema, are costly to scale and face difficulty in managing new data. Hadoop not only makes handling data economically feasible but also works well with structured, semi-structured and unstructured data.
Ameet recommended reading Forrester Wave Report - Big Data Hadoop Solutions. He shared following tips when comparing hadoop vendors:
- Is the solution open or closed source?
- If code is open, who owns the IP?
- What’s available for free and what do you pay for?
- Is the solution substrate agnostic?
- OS support options?
- Partnerships ?
- What’s the pricing model?
- Local resources to help?
He shared a blueprint for enterprise hadoop. Data access from hadoop can be done through various channels and they can be categorized in: batch, interactive and real-time.