Deep Learning With Apache Spark: Part 2

In this article I’ll continue the discussion on Deep Learning with Apache Spark. I will focus entirely on the DL pipelines library and how to use it from scratch.

By my sister

Hi everyone and welcome back to learning :). In this article I’ll continue the discussion on Deep Learning with Apache Spark. You can see the first part here.

In this part I will focus entirely on the DL pipelines library and how to use it from scratch.


Apache Spark Timeline

The continuous improvements on Apache Spark lead us to this discussion on how to do Deep Learning with it. I created a detailed timeline of the development of Apache Spark until now to see how we got here.

Soon I’ll create an article with descriptions for this timeline but if you think there’s something missing please let me know :)


Deep Learning Pipelines



Deep Learning Pipelines is an open source library created by Databricks that provides high-level APIs for scalable deep learning in Python with Apache Spark.

spark-deep-learning — Deep Learning Pipelines for Apache

It is an awesome effort and it won’t be long until is merged into the official API, so is worth taking a look of it.

Some of the advantages of this library compared to the ones that joins Spark with DL are:

  • In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
  • It focuses on ease of use and integration, without sacrificing performace.
  • It’s build by the creators of Apache Spark (which are also the main contributors) so it’s more likely for it to be merged as an official API than others.
  • It is written in Python, so it will integrate with all of its famous libraries, and right now it uses the power of TensorFlow and Keras, the two main libraries of the moment to do DL.

Deep Learning Pipelines builds on Apache Spark’s ML Pipelines for training, and with Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be done efficiently in a few lines of code:

  • Image loading
  • Applying pre-trained models as transformers in a Spark ML pipeline
  • Transfer learning
  • Applying Deep Learning models at scale
  • Distributed hyperparameter tuning (next part)
  • Deploying models in DataFrames and SQL

I will describe each of these features in detail with examples. These examples comes from the official notebook by Databricks.


Apache Spark on Deep Cognition

To run and test the codes in this article you will need to create an account in Deep Cognition.

Is very easy and then you can access all of their features. When you log in this is what you should be seeing:

Now just click on the left part, the Notebook button:

And you will be on the Jupyter Notebook with all the installed packages :). Oh! A note here: The Spark Notebook (DLS SPARK) is an upcoming feature which will be released to public sometime next month and tell that it is still in private beta (just for this post).

You can download the full Notebook here to see all the code:

Image Loading

The first step to applying deep learning on images is the ability to load the images. Deep Learning Pipelines includes utility functions that can load millions of images into a DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale. The new version of spark (2.3.0) has this ability too but we will be using the sparkdl library.

We will be using the archive of creative-commons licensed flower photos curated by TensorFlow to test this out. To get the set of flower photos, run these commands from the notebook (we will also create a sample folder):


Let’s copy some photos from the tulips and daisy folders to create a small sample of the photos.


To take a look at these images on the notebook you can run this:


You should be seeing this

Now let’s use Spark to load this images as a DataFrame. The method spark.readImage lets you read images in common formats (jpg, png, etc.) from HDFS storage into DataFrame. Each image is stored as a row in the imageSchema format. The recursive option allows you to read images from subfolders, for example for positive and negative labeled samples. The sampleRatio parameter allows you to experiment with a smaller sample of images before training a model with full data.


If we take a look at this dataframe we see that it spark created one column, called “image”.

|               image|

The image column contains a string column contains an image struct with schema == ImageSchema.

Transfer learning


Deep Learning Pipelines provides utilities to perform transfer learning on images, which is one of the fastest (code and run-time -wise) ways to start using deep learning. Using Deep Learning Pipelines, it can be done in just several lines of code.

Deep Learning Pipelines enables fast transfer learning with the concept of a Featurizer. The following example combines the InceptionV3 model and logistic regression in Spark to adapt InceptionV3 to our specific domain. The DeepImageFeaturizer automatically peels off the last layer of a pre-trained neural network and uses the output from all the previous layers as features for the logistic regression algorithm. Since logistic regression is a simple and fast algorithm, this transfer learning training can converge quickly using far fewer images than are typically required to train a deep learning model from ground-up.

Firstly, we need to create training & test DataFrames for transfer learning.


And now let’s train the model


Let’s see how well the model does:


Test set accuracy = 0.9753086419753086

Not so bad for an example and with no tunning at all!

We can take look at where we are making mistakes: