Strata + Hadoop World 2015 San Jose – report and highlights

Highlights of Strata + Hadoop World San Jose, including Apache Spark vs Storm vs Samza for streaming data, Kafka as a universal message bus, what Netflix puts in front of HDFS, Parquet as a basis for ETL and analytics, DJ Patil, Internet of Things, and more.

By Jeffrey Sukharev and Ilya Gluhovsky.

(GP: Jeff Sukharev won KDnuggets Raffle for a free pass to Strata + Hadoop World San Jose 2015 and here is a report from him and his colleague about their experience at the conference).

Strata + Hadoop World, San Jose, 2015

The first wow was that Spark seems to have captured everyone¹s imagination. Mathei Zaharia of Databricks gave a detailed talk on Spark Streaming.

Apache Spark It surely sounds like Spark Streaming is production ready and for the most part better than Apache Storm or Samza based on my consensus assessment. Biggest limitation? Its processing mini-batches results in seconds of delay which I imagine is a non-issue for most real-time analytics since there is no user explicitly waiting for the service. Features include 100K+ records per node per second in throughput (much better than Storm), exactly-once semantics, and code and developer proficiency reuse from your batch processing. Samza Vs Storm Vs Spark

Currently, Spark has the largest community of contributors of all other Apache projects. One example of how Spark outperforming Hadoop I found illuminating. On disk sort of 100TB it took Hadoop 72 min on 2K machines while it only took 23 min for Spark on 200 machines.

Second, all the big guys seem to be standardizing on Kafka as a universal message bus. Real-time analytics? Kafka + Spark Streaming, especially with their new API in Spark 1.3 (or Samza if you are a real LinkedIn fan).

Search, sending data to DBs, populating anything from news feeds to monitoring dashboards, processing and re-processing data (yes, dataflow graphs can have cycles!) ­ everything goes through Kafka.

Third, Netflix uses an S3 bucket in front of an HDFS as they do not believe in being able to reliably pipe event data into HDFS directly. This also allows them to spin clusters up and down on demand or failure using Genie.

Fourth, I really enjoyed How to use Parquet as a basis for ETL and analytics talk. If your data dictionary (a list of data fields) is long and most of your queries operate on a small number of them as they often do, it pays to use a columnar data format and get a field or two for all users rather than de-serializing and parsing through loads of unneeded data.

Fifth, some like their data like good sushi: fresh, raw, and ready to eat. Disclaimer: this talk was right before lunch, but no matter. Should you pre-cook your data using complex ETLs and data models or do work at query time? The latter sounds simpler, that¹s for sure. And if you sort, partition, and compress/serialize (yes, Parquet!) your data to make common computations fast, you just might get the best of both worlds. Perhaps, it's an acquired taste.

Last, Apache Drill is your data play tool. If you have bits and pieces of your data in spreadsheets, DBs, HDFS and god knows where else, you can still join all of this good stuff in an ad-hoc query right off the bat. That thing will even guess the data schemas on its own. Natural intelligence?

The photo below shows the demand for Data Scientists - the jobs board was overflowing with ads. Strata + Hadoop 2015 San Jose - Jobs board

Other impressions:
Most keynote presentations concentrated on the opportunities in Big Data space. The age of "internet of things" is upon us. Personalized medicine, revolution in wearable devices, interesting applications of knowledge graph are just some of the topics that were briefly touched by several speakers including Lisa Hamill of Salesforce, Anil Gadre of MapR, Adam Kocoloski of IBM.

data driven DJ Patil, US government first chief data scientist, talked about establishing data-driven culture, described his vision for the role of CDO (Chief Data Officer) in organizations. He urged attendees to explore datasets released by the federal government and build data-driven products using this data.

Prof. Poppy Crum from Stanford talked about her research at Dolby Labs in the context of Big Data. Her work is focused on sensory ambiguity resolution and has applications in immersive gaming.

Jeffrey Heer, professor at University of Washington and the co-founder of Trifacta, gave an interesting talk on visualizing different variations of data instead of focusing on a particular design decisions. He gave an interesting example from a medical application how right visualization can help important discoveries. He also talked about the need to migrate from specialized designer tools to the tools that enable decision makers.

Jeffrey Sukharev Jeffrey Sukharev is a Sr Data Scientist at working on Search Platform and other data-driven projects since July 2013. Previously, he was a software engineer for more than a decade working on software development tools at Rational Software, IBM and on Content Management Server at Interwoven Inc. Sukharev is currently a PhD Candidate in CS at UC Davis, and his interests include data visualization and machine learning.

Ilya GluhovskyIlya Gluhovsky is an entrepreneur, an executive, and a data scientist, working on next generation healthcare analytics. He is a co-founder of GetGoing, a profitable Y Combinator travel technology company, the only travel company to make Time's top 50 websites in 2013. Ilya received his Ph.D. in statistics at Stanford. He is an author of 17 top professional journal publications and an inventor on 21 patents.