BigData TechCon San Francisco Report: Focus on Spark

BigData TechCon SF 2014 covered a number of data technologies from the open source ecosystem through tutorials and classes. Spark and its libraries were a significant focus of the talks.

By Arun Swami, Nov 2014.

Big Data TechCon Oct 27-29, 2014, San Francisco

The conference was well attended and packed with interesting tutorials and classes. There were often multiple classes tutorials/classes at the same time that all looked very interesting but one had to make a choice!

Apache Spark The major takeaway from the conference was the rapid momentum of Apache Spark in the Hadoop ecosystem. A number of speakers said that Spark-Core was reasonable mature and could be used to processing of up to 100 nodes and 1 TB data without any problems. The newer modules in the Spark family (SparkStreaming, SparkSQL, MLLib, ...) can be used and are well along but may become easier to use over the next 6 months.

On the first day, Sameer Farooqui (Databricks) led a superb tutorial/hands-on lab on Hadoop Fundamentals. He covered a number of technologies in the Hadoop ecosystem. The lab is available under the Creative Commons license at He shared that Databricks had chartered Paco Nathan to be developer evangelist for the Spark ecosystem. Their goal is to reach and train 100K people on Spark technologies over the next year!

On the second day, the session by Dean Wampler on Big Data Programming in Spark tried to illustrate how applications could be developed fairly easily in Spark using the Scala programming language. Spark is written in Scala and while Spark has Scala, Java and Python APIs, Scala is the language of choice for working with Spark.

Gloria Lau (Timeful) gave a well attended keynote on "How do we build Data Products in the Right Order". She outlined a number of rules of thumb to approach building data products well. She suggested two questions to ask:
  • What is the one metric that your data product will move?
  • When collecting data ask: If each user uses your product a minute/day, how would you use it get data?

She suggested that Donald Knuth's quotation
"Premature optimization is the root of all evil"
as a guideline for deciding what to work on next in building data products.

The talks on "Spark Streaming" and using "GraphX for graph analysis on top of Spark" were thought provoking but indicated that the technologies were still in flux.

Krishna Sankar's two part class on "Machine Learning in Python using Spark" were well prepared and delivered. The class conveyed a lot of information that allowed people to see how they could using Spark MLLib to perform common machine learning tasks including data wrangling. Here are his slides.

There were exhibits from a number of vendors including Splice Machine, Actuate, Aerospike, Data Torrent, and Voltage Security.

Arun Swami Arun Swami is a Bay Area entrepreneur and tech leader, who created innovative systems using text mining, ranking algorithms, heuristic approaches, data mining, personalization technology, database algorithms and optimization algorithms. Arun was a key member of the team that started IBM's research in data mining and has published seminal work in this area. His classic data mining paper with Rakesh Agrawal "Mining Association Rules" is ranked among most cited CS papers.