Big Data Bootcamp, Austin: Day 1 Highlights

Highlights from the presentations by Big Data and Analytics leaders/consultants on day 1 of Big Data Bootcamp 2015 in Austin.

big-data-bootcamp-austinBig Data Bootcamp was held in Austin during Apr 10-12, 2015. It provided a platform for leading experts to share interesting insights into the innovations that are driving success in the world's most successful organizations. Budding data scientists as well as decision makers from a number of companies came together to get technical overview of the Big Data landscape. The camp was targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. Through practice on interesting examples, attendees got hands on experience of popular technologies in Big Data Space like Hadoop, Spark, HBase, MapReduce, etc.

Highlights from day 1:

Srini Penchikala, Lead Editor for NoSQL, kicked off the first day of the conference talking about distributed data computing platforms. Talking about data storage, he mentioned that as per CAP theorem it is not possible to get an acceptable latency when having all of following three properties at the same time: Consistency, Availability and Partition Tolerance.
Relational databases provides consistency as well as availability. However, NoSQL databases either provide consistency and partition tolerance (in case of MongoDB, HBase, Redis) or availability and partition tolerance(in case of CouchDB, Cassandra, DynamoDB, Riak). He shared some interesting uses cases of data management techniques.

After giving a brief overview of some popular NoSQL databases in market, he talked about in-memory data grids which store keys and values as objects. These do not enforce any constraints of rigid SQL schemas. Some use cases of in-memory data grids include trading systems, online gaming and session data caching. In-memory data grids currently available in market are: Hazel, MemSQL, SAP HANA, Terracotta's Big Memory and Oracle Coherence.

He explained important concepts and technologies such as MapReduce, Hadoop, Resilient Distributed Dataset, Spark for large-scale processing. While explaining the Spark ecosystem, he stated that Apache Spark unifies batch, streaming and interactive computing. By supporting iterative, graph-parallel algorithms, Spark makes it easy to build sophisticated applications. Data Frames, newly released in Spark v1.3, are distributed collections of data organized into named columns, and are equivalent to a table in RDBMS or a data frame in R/Python. He also talked about Tachyon, a memory-centric distributed file system, and how it provides reliable file sharing at memory-speed across cluster frameworks.
Regarding streaming data, he mentioned that it is becoming the first class citizen for data driven companies. This data is continuously processed and transformed to derive new data feeds. He briefly described how Spark Streaming and Apache Kafka works. He concluded by talking about resource management tools YARN and Mesos. Srini also gave hands-on tutorial on Spark Core, Scala, Spark Streaming, and Spark GraphX.

Kimberly Wilkins, Senior DBA and Database Denizen, ObjectRocket delivered a talk titled "A Primer on NoSQL Scaling: Emphasis on MongoDB". She started by describing term "Scalability" and methods to achieve it in various ways like scaling vertically or up and scaling horizontally or out. Scaling up costs far more than scaling out.
NoSQL database technology growth is a significant chunk of the data explosion in recent past. NoSQL is rapidly growing because it is faster, provides more flexible development, and involves low software and deployment costs. She mentioned that MongoDB is the fastest growing #4 DB. It has built-in sharding via replica sets for scaling out. She described sharded clusters and how sharding provides horizontal scaling. She explained the process of sharding by discussing strategy for each step such as selection of best shard key, creating required sharding index, enabling sharding at DB level and sharding the collection. She also shared some DBA tips for managing at scale.
Highlights from day 2