Spark Summit 2015 San Francisco – Day 1 Keynote Highlights

Highlights from keynote speeches delivered by various eminent big data technology leaders from industry and academia at Spark Summit 2015 Conference held in San Francisco.

Twitter Handle: @hey_anmol

Spark-Summit-SF-2015-LogoApache Spark is currently one of the hottest technologies in data science space. Spark Summit 2015, a top quality conference focused on Apache Spark being held in San Francisco is bringing together the Apache Spark Community from various parts of the world. The three-day event (July 15-17, 2015) is still on.

Leading production users of Spark, SparkSQL, Spark Streaming and other relevant technologies are discussing project development and use of Spark Stack in variety of verticals and applications.  On day1 and 2, there were 2 keynote sessions followed by three tracks that ran in parallel: Developer, Data Science and Applications.  On day 3, Databricks is hosting three different parallel day-long Spark training sessions: Introductory Apache Spark Training, Advanced DevOps Spark Training and Data Science with Spark.

Here are highlights from keynotes on day 1:

Matei Zaharia, Apache Spark Creator opened the summit offering some impressive statistics on the current state of Apache Spark Project. Some important points he mentioned:

  • In the past year, number of contributors increased by 255 to 730 and committed lines of code increased from 175k to about 400k.
  • Largest Spark cluster having 8000 nodes is at Tencent in China with 1 billion users and largest single job was on 1 petabyte at Alibaba.
Patrick Wendell, Founding Committer and PMC member of Apache Spark recapped features of Apache Spark 1.4. Apache Spark 1.5 would have following more features: Project Tungsten, expansion of SparkR to include ML APIs and more streaming features.

Ion Stoica, CEO, Databricks announced that Databricks is now generally available (GA). New features available in Databricks deployments are:
  • Spark 1.4 Support
  • Spark Streaming in notebooks
  • Improved commenting

Ali Ghodsi, VP, Engineering & Product Management, Databricks gave a quick demonstration of Databricks cloud describing its features.

Anil Gadre, Senior VP, Product Management, MapR shared few good use cases and success stories of some companies.

Beth Smith, General Manager, Analytics Platform, IBM made following big announcements:
  • IBM will build Spark into the core of its analytics and commerce products.
  • IBM Watson Health Cloud will leverage Spark.
  • IBM will offer Spark as a Cloud service over Bluemix platform
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects and open a Spark Technology Center in San Francisco
  • IBM will educate more than 1 million data scientists and data engineers on Spark

Arun C Murthy, Founder & Architect, Hortonworks asserted that Spark and Hadoop work perfect together. Hortonworks has enhanced Spark to be enterprise ready by enabling Spark on YARN and applying enterprise governance, security and operations services for Spark applications. They have integrated Spark as part of HDP 2.2 release. He also talked about Apache Zeppelin - a web-based notebook that enables interactive data analytics. One can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Gloria Lau, VP of Data, Timeful (acquired by Google) emphasized that companies should not hire data scientists and put them to work as “SQL monkeys”. Instead the data scientists should implement self-service BI for business users.

Chris Mattman, Chief Architect, NASA/JPL talked about complex challenging data problems and how they’re using Spark to solve them.

Tim O’Reilly, Founder, O’Reilly Media delivered a keynote on “Software Above the Level of a Single Device: The Implications”. He emphasized that we all should work on things that matter.

Highlights from keynotes on day 2 will be published soon.