Dataiku Data Science Studio, now also runs on Apache Spark

Dataiku Data Science Studio version 2.1 has many useful features for Data Scientists, including integration with Apache Spark.

Like most software developers on the advanced analytics market, we've been keeping a close eye on the growth of Apache Spark. The cluster computing platform has transformed the concept of data access: Apache Spark has enabled programs to run 100x faster than Hadoop MapReduce in memory and 10x faster on disk. These newly-won benchmark speeds have empowered organizations to run iterative analytics on massive datasets, effectively giving them access to a new world of predictive analytics. That's why we've decided to pare the capabilities of Apache Spark with the advanced analytics features of DSS to create significant opportunities for those looking to leverage very large data sets. Indeed, today, we’re happy to announce integrated functionality between DSS 2.1 and Spark!

(1) Spark for coders, but also for clickers

For clickers:
Visual Recipes, which are a core component of DSS, can now be executed on the Apache Spark framework, while leveraging the Spark SQL programming language and data processing engine. DSS data wrangling recipes such as Prepare, Group, Join, and Stack can be executed in Spark helping DSS users perform tasks such as joins and aggregations dozens if not hundreds of times faster than what could be accomplished with Hadoop using Apache Hive.

For coders:
DSS offers an Integrated Development Environment in which developers can rapidly build Ad Hoc queries, which are then processed against selected data sets, creating visual representations of the relationships found in the data. With Apache Spark integration, DSS now has the ability to work with Spark R, Spark SQL, and PySpark, which brings R, SQL, and Python based programing to the Spark environment. Much like the other components of Spark, PySpark and Spark R ease and speed up the native capabilities found in DSS and make Spark a viable alternative to the traditional Hadoop/Hive stack.



(2) It's not just about Volume, it's also about Collaboration

When using R or Python local stacks for interactive analysis with advanced algorithms, volume of data is essentially limited to a few gigabytes. With Apache Spark, however, the volume limit increases to hundreds of gigabytes. But volume is just one element of the equation - speed, both in-memory and on-disk, is essential... After all, faster speeds mean that more data can be computed, which also means that models and forecasts are more accurate. Thankfully, speed is something that Apache Spark has plenty of.

DSS Spark integration promotes a collaborative environment via its PySpark and Spark R framework. Team members using the PySpark or Spark R frameworks can share cluster resources, effectively distributing computing power without compromising performance. Furthermore, it allows the team to work together in the languages and technologies they know best (R, Python, and more) and to share data engineering recipes while limiting the need to recode or redevelop algorithms.