Strata + Hadoop World 2015 San Jose – Day 1 Highlights

Here are the quick takeaways and valuable insights from selected talks at one of the most reputed conferences in Big Data – Strata + Hadoop World 2015, San Jose.

One of the biggest challenges at big conferences such as Strata + Hadoop World 2015 is that there is so much happening quickly and simultaneously that it is almost impossible to catch all the action.
strata-hadoop-world-2015-san-jose We help you by summarizing the key insights from some of the best and most popular sessions at the conference. These concise, takeaway-oriented summaries are designed for both people who attended the conference but would like to re-visit the key sessions for a deeper understanding and people who could not attend the conference.

Quick Takeaways from Keynotes:

Amr Awadallah, CTO, Cloudera shared his vision of the Modern Information Architecture. He mentioned that the growth of Big Data technology should primarily focus on Flexibility, Scalability and Balancing the economics of how we are storing data. He added that Hadoop isn't just Hadoop anymore. Hadoop stack is constantly evolving and growing.

Eric Frenkiel, CEO, MemSQL shared a very unique view of real-time database for transactions and analytics. He announced MemSQL and Spark connection to enable real-time big data analytics.
Lisa Hammit, Salesforce talked about how wearables are contributing to Big Data and how the resulting insights are already delivering significant gains in key industries such as Health, Fashion/Retail and Sensory enhancements.

Anil Garde, MapR emphasized that it is essential to take a data centric approach to infrastructure to provide flexible, real-time data access, collapsing data silos and automating data-to-action for immediate operational benefits.

Adam Kocoloski, IBM talked about partnership of IBM and Twitter combining advances in analytics, cloud and cognitive computing in a manner that has the potential to transform how institutions understand customers, markets and trends.

usdsUS President Barack Obama talked about the importance of Big Data and Data Science, and introduced Dr. DJ Patil as the first ever Chief Data Scientist and Deputy Chief Technology Officer for Data Policy. DJ talked about mission and responsibilities that lies ahead for Data Science professionals.

DJ outlined the top priorities as:
  • Precision Medicine
  • Data Products
  • Responsible Data Science

He mentioned that Data Science is a team sport and asked each and every data scientist to join him to make a difference.

Solomon Hsiang, UC Berkeley discussed how data and statistical inference are informing how we manage the global climate rationally, a defining policy challenge for our generation.

Poppy Crum, Dolby Laboratories talked about how understanding sensory interactions and being able to define them perceptually and algorithmically allows technological developments that can facilitate sensory enhancement and optimization.

Valuable Insights from Selected Talks:

Dr. Joe Hellerstein and Adam Silberstein from Trifacta talked about technical challenges in making data profiling agile in the Big Data era. They shared some research results and practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches for different profiling needs in a Big Data context. They discussed the considerations for using Hadoop technologies for data profiling, and some of the pitfalls in the contexts of both massive Internet services, and end-user profiling tools. They emphasized on "ABP: Always Be Profiling" by injecting lightweight "sidecar" jobs automatically and exposing outputs visually for effortless human inspection. They also briefly discussed some methods to make wise tradeoffs on latency and accuracy: approximation, heuristics and reasonable assumptions and understanding the performance of underlying system.

Sheetal Dolas, Hortonworks talked about design patterns. He defined design pattern as a general reusable solution to a commonly occurring problem within a given context in software design. Design patterns are required because streaming use cases have distinct characteristics (such as unpredictable incoming data patterns) and high scale continuous streams pose challenges of peaks and valleys. He categorized streaming patterns as the following:
  • Architectural (e.g.: real-time streaming)
  • Functional (e.g.: Stream Joins)
  • Data Management (e.g.: External Lookup)
  • Data Security (e.g.: Message Encryption)

external-lookupHe discussed external lookup, responsive shuffling and out-of-sequence events under Data Management Patterns. He described external lookup as referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies and system bottlenecks, while maintaining high throughput. External lookup has various benefits such as: only required data is cached, each bolt caches a partition of reference data, etc. On the other hand, the challenges with using external lookups are: increased latency due to frequent external system calls, scalability & performance issues with large data reference sets, etc.

He described responsive shuffling as automatically adjusted shuffling for better performance and throughput during peaks and varying data skews in streams. With responsive shuffling, it is challenging if stream is unpredictable and skewed, etc. However, we benefit using it as topology responds to changes in data patterns and adopts accordingly. He also discussed good and bad aspects of out-of-sequence events.

Sumeet Singh and Thiruvel Thirumoolan from Yahoo shared approaches to manage data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs as they are becoming critical with increasing scale of operations. They explained how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications.

They presented growth of HDFS in last 10 years within a nice infographic and shared that at present they run 42,300 servers and store about 600 PB of data. They briefly explained processing and analyzing with Hadoop. They shared distribution of Hadoop Jobs on their platform as per January. Top 3 among these were: Pig - 47%, Oozie Launcher - 35% and Java MapReduce - 7%.

Managing data on multi-tenant platforms poses various challenges such as:
  • Data shared across tools such as MR, Pig and Hive
  • Schema and semantic knowledge across the company
  • Fine-grained access controls(row/column) vs. all or nothing

They talked about Apache HCatalog which is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored.
Ted Dunning, MapR and Ellen Friedman presented some real world use cases of how people who understand new approaches using Hadoop and NoSQL in production can employ them well in conjunction with traditional approaches and existing applications.

Highlights from Day 2