Big Data BootCamp: Highlights of talks on Day 3
Highlights from the presentations by big data technology practitioners from Hortonworks, Intel, Rackspace, SciSpike, and Yahoo at Big Data Bootcamp 2014 in Santa Clara.

Despite the great quality of content as well as speakers, it is hard to grasp all the information during the bootcamp itself. KDnuggets helps you by summarizing the key insights from all the sessions at the bootcamp. These concise, takeaway-oriented summaries are designed for both – people who attended the bootcamp but would like to re-visit the key sessions for a deeper understanding and people who could not attend the bootcamp. As you go through it, for any session that you find interesting, check KDnuggets as we would soon publish exclusive interviews with some of these speakers.
In case you missed: Highlights of talks on Days 1-2 (Apr 23-24)
Here are highlights from selected talks on day 3 (Fri Apr 25):

He explained how Tez design themes empower end users and gain performance advantage over Map Reduce with optimal resource management. Talking about the current status of Tez, he said it is currently an Apache Incubator Project and is under rapid development. For now, focus is on stability and support for a vast topology of DAGs. On its roadmap the next focus areas are richer DAC support, performance optimizations and usability.

He mentioned some of the legacy architecture pain points such as increasing data volumes and scale, high report arrival latency, lack of interactive SQL, etc. Spark solves need for interactive data processing at REPL and SQL levels. Now developers are no more restricted by Hadoop MapReduce paradigm. Spark involves less code (up to 2-5x) and deep integration into Hadoop ecosystem. Spark is an open source distributed SQL query engine for Hadoop data. Common HiveQL provides seamless federation between Hive and Shark. He explained Spark on YARN model. He ended the talk discussing some of upcoming contributions by Yahoo: fix scheduling performance issues, handle large number of small splits, handle master fault tolerance and many others.

He claimed that the basic techniques for large scale simulation and computing are ready. However, large and time-consuming computing tasks need steering and most of them have a large number of parameters that needs to be tuned. Smart data processing algorithms are ready. However, most of data mining algorithms have high computational complexity i.e. polynomial rather than logarithmic or linear. At the end, expertise is more important than the tool and data is definitely not a replacement for intuition and intelligence.

Next, he put forward some typical use-cases for each NoSQL store category. He mentioned that some of the challenges of NoSQL are: no standards, no typical schema and systems targeting specific areas. Next, he explained the MapReduce Programming Model and best practices to create MapReduce jobs. Talking of Slume, he described it as a distributed streaming tool for collecting, aggregating and moving large amounts of log data. It is horizontally scalable and centrally managed with tunable data reliability. At the end, he restated the important recommendations: use the most suitable technology for one’s task, scale out to crunch big data and integrate with conventional technologies.

90% of the data in the world today was created within the last two years and it is likely to reach 40 trillion gigabytes by 2020.His talk focused on the market trends in Big Data and showcased the ecosystem growth that has taken place. He highlighted startups and the value they're bringing to moving Big Data forward. Next, he shared his thoughts on the top 10 trends in Big Data Analytics:
- Time to return on investment in Analytics will a top priority
- Shift from generic solutions to industry specific vertical solutions will be built
- Analytics everywhere and for everyone
- Sales & Marketing as a key payer for Big Data
- Real-time Big Data will become ubiquitous
- IOT will drive mainstream adoption of Analytics
- Significant increase in the automation in ETL, Analytics and BI
- High growth in volume data creators and BI users as new participants join
- IT shops already at a crossroad of balancing cost, security and reliability
- Be ready for a world of Big Data Startups as the barriers to entry diminish