Apache Spark, the hot new trend in Big Data

Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. Leveraging Hadoop Yarn, Alpine has made it very simple to get started with Spark.

By Joel Horwitz, Alpine Data Labs, Apr 16, 2014.

Two years ago I was having coffee with a friend of mine and now colleague Dr. Will Ford in a cafe in San Mateo.  We were talking about data science and analytics when he leaned in real close to say, “Have you heard of Spark? This is going to change everything, again.” I had not heard of Spark and started researching the technology the moment I got back to my desk.  I quickly realized what all of the fuss was about when landed on the Berkeley AMPLab.

Apache SparkSpark is new technology that sits on top of Hadoop Distributed File System (HDFS) that is characterized as “a fast and general engine for large-scale data processing.”  Spark has three key features that make it the most interesting up and coming technology to rock the big data world since Apache Hadoop in 2005.
  1. For iterative analysis like logistic regression, Random Forests, or other advanced algorithms, Spark has demonstrated 100X increase in speed that scales to hundreds of millions of rows.
  2. Spark has native support for the latest and greatest programming languages Java, Scala, and of course Python.
  3. Spark has generality or platform compatibility in both directions meaning it integrates nicely with SQL engines (Shark), Machine Learning (MLlib), and streaming (Spark Streaming) without requiring new software installed on the cluster using Hadoop’s new YARN cluster manager.

At Alpine, we have made it dead simple to get started with Spark by including the technology in our latest build out of the box.  We require no additional software or hardware to leverage our extensive list of operators for data transformation, exploration, and building advanced analytic models.  We leverage Hadoop Yarn (Hadoop NextGen) to launch Spark job without any pre-installation of Spark or modification of cluster configuration. This empowers our customers to have seamless integration of our Spark implementation and their Hadoop stack.  For example, we have analyzed 50 Million rows of account data in 50 seconds on a 20 node cluster recently at last month GigaOM conference.

The screenshot below shows how Spark does a quick in-memory iteration.  It uses a standard way to do the gradient aggregation, as implemented by Databricks, a company which commercializes  the Apache Spark framework. Spark In-memory Iteration Also, see a demo at http://video.alpinenow.com/medias/f1nq8m48eu

Interested in learning more about Alpine Chorus and Spark? Head over to http://start.alpinenow.com to get started.

Joel Horwitz  is a a passionate data guru at Alpine Data Labs.