Spark Summit 2015 San Francisco – Day 2 Keynote Highlights

Highlights from keynote speeches delivered by various eminent big data technology leaders from industry and academia at Spark Summit 2015 Conference held in San Francisco.

Twitter Handle: @hey_anmol

Spark-Summit-SF-2015-LogoApache Spark is currently one of the hottest technologies in data science space. Spark Summit 2015, a top quality conference focused on Apache Spark being held in San Francisco brought together the Apache Spark Community from various parts of the world. The conference was three-day long (July 15-17, 2015).

Leading production users of Spark, SparkSQL, Spark Streaming and other relevant technologies discussed project development and use of Spark Stack in variety of verticals and applications.  On day1 and 2, there were 2 keynote sessions followed by three tracks that ran in parallel: Developer, Data Science and Applications.  On day 3, Databricks hosted three different parallel day-long Spark training sessions: Introductory Apache Spark Training, Advanced DevOps Spark Training and Data Science with Spark.

Highlights from keynotes on day 1

Here are highlights from keynotes on day 2:

Reynold Xin, Cofounder, Databricks explained why they’re focusing on compute instead of just IO. He talked about Project Tungsten, which is aiming to make Spark faster.  He presented roadmap of Project Tungsten. Apache Spark 1.4 configuration has already enabled experimental Project Tungsten. The key concept here is to spend less time creating objects and collecting garbage. Xin presented a review of key hardware trends and project’s impressive performance gains to date. Key quote: “It is not Spark VS Python/R but Spark AND Python/R”

Mike Olson, Chief Strategy Officer, Cloudera started the talk with mentioning the large number of attendees and taking a selfie along with attendees in background. He showcased amazing success of Spark in a financial company and a consumer company. He explained how perfectly and where exactly Spark fits in the Hadoop ecosystem. Spark will be dominant general-purpose processing engine in Hadoop and it extends the Hadoop ecosystem with new analytic and processing capabilities. He mentioned “Apache Spark would be the processing engine for Big Data”.
Brian Kursar, Director of Data Science—R&D, Toyota Motor Sales, USA described the big data journey of Toyota through Customer 360 Project, which is currently in production.  Toyota now leverages Spark since it combines compute, streaming and machine learning in a single framework. Recoding a batch job on Spark they reduced the runtime from 160 hours to just 4 hours. Toyota uses social media monitoring, analysis of tweets and having an interesting social ML pipeline in place.  Spark is enabling kaizen because of its speed.

Matt Wood, GM, Product Strategy, AWS mentioned that the 3Vs are no longer constraints but productivity is still a challenge in Big Data processes. He showcased few customers such as Washington Post, Gumgum, etc. using Spark in production with Amazon EMR. He also announced availability of a new Spark on EMR service from AWS.

Doug Wolfe, Chief Information Officer, Central Intelligence Agency delivered an impressive talk giving an overview of CIA’s key IT requirements and approach to satisfy them. He mentioned that too much red tape slows down innovation in typical large organizations. Wolfe said one shouldn't build a product without understanding the market for it. CIA recently adopted C2S cloud (AWS) and it is a major change to the way they did business. C2S provides them a compute fabric on demand. CIA challenge requires a marketplace of continuous ideas and innovation derived from all sources.
James Peng, Principal Architect, Baidu at first introduced Baidu as China’s leading search engine, with about 73% of search market share. Briefly giving a review of Baidu’s Big Data Infrastructure he mentioned that Baidu hadoop cluster currently has more than 13,000 nodes. Baidu started working with Spark 0.8, then developed a 200 node cluster in Spark 1.0. Baidu has put SparkSQL 1.2 in production and is now running more than 1,000 nodes. Baidu’s Interactive Query Engine runs over Spark/Tachyon reducing query run time drastically to 4 seconds as compared to 30 seconds with Hive. When compared to MapReduce they observed more than 50X performance improvement.

Michael A. Greene, VP, Software & Services Group, Intel announced performance portal for Spark and briefly talked about Spark’s speedup record for a number of companies. In addition, he announced the release of Streaming SQL for Apache Spark, part of a project to develop a complete open source framework for streaming analytics and making these capabilities pervasive.