KDnuggets Home » News » 2014 » Mar » Software » Alpine Data expects faster, easier Data Science with Spark ( 14:n06 )

Alpine Data expects faster, easier Data Science with Spark


Alpine Data Labs becomes one of the first companies to be certified on Apache Spark, reported up to 100x faster than Hadoop. Alpine answers 3 questions from KDnuggets.



By Gregory Piatetsky, Mar 18, 2014.

Alpine Data Labs Alpine Data Labs announced today that it is one of the first Enterprise Advanced Analytics Platform to be certified by Databricks on Apache Spark.

Databricks was founded by the creators of Spark and recently announced Spark Certification to encourage new development.

Alpine software is a collaborative, scalable and visual solution for Advanced Analytics on Big Data and Hadoop, which allows both data scientists and business analysts to work with large data sets, develop and collaborate on models without having to ever use code, download software or move data.

Alpine was included recently among niche players in Gartner Magic Quadrant for Advanced Analytics Platforms.

The figure below, provided by Alpine, shows the relative advantage of Spark + Alpine over Hadoop.

Relative performance of Hadoop, Spark, and Alpine
Hadoop iterative algorithms scan through the data each time, taking 921 seconds to go through 150M rows. Witrh Spark, data is cached in memory after the first iteration. Alpine Quasi-newton method give further speedup and allow Alpine to process the same data in 97 seconds.

I asked Steven Hillion, Chief Product Officer, Alpine Data Labs about the latest certification on Spark.

Gregory Piatetsky: 1. What in your opinion is the biggest impact of Spark? Doing queries faster, enabling more complex analytics, something else?

Steve Hillion: Certainly the thing that Spark is most famous for is increasing the speed of Hadoop, especially on iterative operations where caching the data into memory can speed things up by one or two orders of magnitude. But it comes with a number of goodies that are very appealing to the data scientist. The addition of a machine learning library with MLLib provides the potential for a general framework for advanced analytics on big data; Scala is a very natural basis for doing data science development; and there are natural abstractions for handling datasets and so on that will make it feel like a natural environment for doing investigative analytics.

GP: 2. What is a good example of what you can do with Spark that cannot be done with Hadoop?

SH: At Alpine, we place a strong emphasis on being able to iterate quickly. So it's very important for us to let the user get their hands on the data easily, then build experimental analytics workflows, and then iterate on those without having to reload data or recompute things that haven't changed. Spark makes this much, much easier and faster than Hadoop.

GP: 3. How does Spark affect the collaborative aspect of Alpine Chorus?

SH: Because Spark allows us to return results to scientists, business users and executives at 100x the speed, the collaborative and iterative nature of analytics work done in Alpine is even more visible. That's why we chose to talk about this announcement as "Hadoop at the Speed of Business". In a way, the combination of an agile process enabled by Alpine's collaborative capabilities and Spark's lighting fast performance make the two a match made in heaven.

You can try Alpine at start.alpinenow.com.

Sign Up