Big Data BootCamp: Highlights of talks on Day 3

Highlights from the presentations by big data technology practitioners from Hortonworks, Intel, Rackspace, SciSpike, and Yahoo at Big Data Bootcamp 2014 in Santa Clara.

Big Data BootcampBig Data Bootcamp 2014 (Apr 23-25, 2014) was organized by Global Big Data Conference at Santa Clara Convention Center in Santa Clara, CA. This 3 day extensive, fast paced and vendor agnostic bootcamp provided a comprehensive technical overview of Big Data landscape to the attendees. Big Data Bootcamp was targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. It brought together data science experts from industry for three days of insightful presentations, hands-on learning and networking. It covered a wide range of topics including Hadoop, Map Reduce, Amazon EC2, Cassandra, YARN, Pig, different use cases and much more.

Despite the great quality of content as well as speakers, it is hard to grasp all the information during the bootcamp itself. KDnuggets helps you by summarizing the key insights from all the sessions at the bootcamp. These concise, takeaway-oriented summaries are designed for both – people who attended the bootcamp but would like to re-visit the key sessions for a deeper understanding and people who could not attend the bootcamp. As you go through it, for any session that you find interesting, check KDnuggets as we would soon publish exclusive interviews with some of these speakers.

In case you missed: Highlights of talks on Days 1-2 (Apr 23-24)

Here are highlights from selected talks on day 3 (Fri Apr 25):

Bikas SahaBikas Saha from Hortonworks gave a talk on Apache Tez. He introduced Tez as a distributed execution framework built on top of YARN and targeted towards data-processing application. Tez provides more general data-processing applications to the benefit of the entire ecosystem. Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks.

He explained how Tez design themes empower end users and gain performance advantage over Map Reduce with optimal resource management. Talking about the current status of Tez, he said it is currently an Apache Incubator Project and is under rapid development. For now, focus is on stability and support for a vast topology of DAGs. On its roadmap the next focus areas are richer DAC support, performance optimizations and usability.

Nanda Kumar JayakumarNanda Jayakumar, Senior Principal Architect, Yahoo! delivered a talk on the big data ecosystem at Yahoo! and the role played by Apache Spark in that ecosystem. In Enterprise big data there is a clear movement towards BI tools and higher-level interfaces such as SQL & (J/O)DBC. Another trend is of migration away from batch towards interactive and other modes. Now, it is more about resource management(YARN/Mesos) and less about proper Hadoop MapReduce.

He mentioned some of the legacy architecture pain points such as increasing data volumes and scale, high report arrival latency, lack of interactive SQL, etc. Spark solves need for interactive data processing at REPL and SQL levels. Now developers are no more restricted by Hadoop MapReduce paradigm. Spark involves less code (up to 2-5x) and deep integration into Hadoop ecosystem. Spark is an open source distributed SQL query engine for Hadoop data. Common HiveQL provides seamless federation between Hive and Shark. He explained Spark on YARN model. He ended the talk discussing some of upcoming contributions by Yahoo: fix scheduling performance issues, handle large number of small splits, handle master fault tolerance and many others.

Arpit GuptaArpit Gupta, Senior Leader, Data and Analytics Science, Rackspace talked about “Relationships in Big Data, Cloud & Predictive Analytics”. He started by emphasizing the huge amounts of data being generated every day and explaining different applications for big data analytics. Among the most requested uses of Big Data are log analytics & storage, smarter utilities, RFID Tracking & Analytics, and Fraud/Risk Management Modeling.

He claimed that the basic techniques for large scale simulation and computing are ready. However, large and time-consuming computing tasks need steering and most of them have a large number of parameters that needs to be tuned. Smart data processing algorithms are ready. However, most of data mining algorithms have high computational complexity i.e. polynomial rather than logarithmic or linear. At the end, expertise is more important than the tool and data is definitely not a replacement for intuition and intelligence.

Vladimir BacvanskiVladimir Bascvanski from SciSpike delivered a speech on "How to Succeed with Polyglot Persistence in the Enterprise". The term “polyglot persistence” was coined to describe that big data programmers should use several database systems, each being the best choice in its application area. With the help of a graphic he explained that even today relational databases, though feasible, are expensive and slow. On the other hand NoSQL enables unlimited volume growth stabilizing cost and increasing performance. Typical NoSQL systems are non-relational, distributed, horizontally scalable and have no need for a fixed schema. He then explained NoSQL stores categorizing them as column-family, document-oriented, key-value and Graph-DB. One should choose a store that is a best match for one’s application.

Next, he put forward some typical use-cases for each NoSQL store category. He mentioned that some of the challenges of NoSQL are: no standards, no typical schema and systems targeting specific areas. Next, he explained the MapReduce Programming Model and best practices to create MapReduce jobs. Talking of Slume, he described it as a distributed streaming tool for collecting, aggregating and moving large amounts of log data. It is horizontally scalable and centrally managed with tunable data reliability. At the end, he restated the important recommendations: use the most suitable technology for one’s task, scale out to crunch big data and integrate with conventional technologies.

Sanjit DangSanjit Dang, Investment Director at Intel started with mentioning a quick fact:
90% of the data in the world today was created within the last two years and it is likely to reach 40 trillion gigabytes by 2020.
His talk focused on the market trends in Big Data and showcased the ecosystem growth that has taken place. He highlighted startups and the value they're bringing to moving Big Data forward. Next, he shared his thoughts on the top 10 trends in Big Data Analytics:
  1. Time to return on investment in Analytics will a top priority
  2. Shift from generic solutions to industry specific vertical solutions will be built
  3. Analytics everywhere and for everyone
  4. Sales & Marketing as a key payer for Big Data
  5. Real-time Big Data will become ubiquitous
  6. IOT will drive mainstream adoption of Analytics
  7. Significant increase in the automation in ETL, Analytics and BI
  8. High growth in volume data creators and BI users as new participants join
  9. IT shops already at a crossroad of balancing cost, security and reliability
  10. Be ready for a world of Big Data Startups as the barriers to entry diminish