Big Data Bootcamp, Austin: Day 2 Highlights

Highlights from the presentations by Big Data and Analytics leaders/consultants on day 2 of Big Data Bootcamp in Austin.

big-data-bootcamp-austinBig Data Bootcamp was held in Austin during Apr 10-12, 2015. It provided a platform for leading experts to share interesting insights into the innovations that are driving success in the world's most successful organizations. Budding data scientists as well as decision makers from a number of companies came together to get technical overview of the Big Data landscape. The camp was targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. Through practice on interesting examples, attendees got hands on experience of popular technologies in Big Data Space like Hadoop, Spark, HBase, MapReduce, etc.

Highlights from day 1

Highlights from day 2:

Ajay Bhargava, CEO, Analytics Advisory Group delivered an interesting talk titled "Career in Analytics & Big Data - An Analytical Approach to “tasting before eating” ". He started with sharing definition of analytics from different sources and briefed it as the method of logical analysis. He described Big Data as voluminous amount of data that becomes challenging to capture, store, search, analyse and visualize using traditional data management tools. He briefly explained 5 Vs of Big Data as: Volume, Value, Variety, Velocity, Veracity and Variety. He discussed how harnessing data is different from harvesting value in different dimensions such as data, process, technology, etc. He mentioned that analytical approach is cyclical, iterative, outcome-driven and involves continuous-improvement. He shared a very interesting slide (given below) which describes the process to check if a career in analytics is the right for you or not.

Johnathan Ellis, CTO, DataStax and Project Chair, Cassandra delivered a keynote on Cassandra architecture and roadmap. He described that Cassandra is fully distributed and has no single point of failure. It also provides rapid read protection. He correlated HDFS architecture to container truck and Cassandra as the engine. Cassandra provides tunable consistency. Here we prefer availability by default and opt into linearizability (ACID) as needed.

The core values in Cassandra are: massive scalability, high performance and reliability/availability. It provides performance with scale resulting in low latency. He explained along with examples how Cassandra also supports JSON, user-defined functions and UDF aggregation, local as well as global indexes. Cassandra Query Language (CQL) offers a model very close to SQL in the sense that data is put in tables containing rows of columns.

Andy Terrel, Chief Scientist, Continuum Analytics delivered an overview of the Python data analytics stack. He started his talk with an introduction of Continuum Analytics and walkthrough of the key features of Wakari platform that facilitates collaborative data analytics on the servers for their client. He mentioned that among the large number of programming languages, Python is unique because it is serving a wide spectrum of users and programmers (such as novice programmers, seasoned developers, quants, scientists, etc.). Python is one of the most often used tool for data analytics, only behind SQL and R. Recently, we have been seeing articles about Python displacing R as the programming language for Data Science. Next, he explained the SciPy ecosystem and the NumPy stack. Lastly, he provided an overview of a few libraries from the PyVis ecosystem to show how Python is being extensively used for data visualization.
Srini Penchikala, Lead Editor for NoSQL, gave an interesting workshop on graph databases and Neo4J. He clarified at first that graph databases do not store graphs/charts and graphs are very different from charts. A graph databases is basically an online DBMS with CRUD methods that expose a graph data model.

Two important components of graph databases are: Native graph storage engine and native graph processing. Graph databases have no rigid schema and they are organized collections of nodes and relationships. Graph databases are heavily used for fraud detection, recommendations, geo-routing and social data analysis. Cypher is graph query language which is human readable and very expressive. At the end, Srini shared some best practices for data modeling, storage, entity relationship & joins, transactions, and concurrency control.

Highlights from day 3