Deep Learning With Apache Spark: Part 2

In this article I’ll continue the discussion on Deep Learning with Apache Spark. I will focus entirely on the DL pipelines library and how to use it from scratch.

Applying Deep Learning models at scale

Deep Learning Pipelines supports running pre-trained models in a distributed manner with Spark, available in both batch and streaming data processing.

It houses some of the most popular models, enabling users to start using deep learning without the costly step of training a model. The predictions of the model, of course, is done in parallel with all the benefits that come with Spark.

In addition to using the built-in models, users can plug in Keras models and TensorFlow Graphs in a Spark prediction pipeline. This turns any single-node models on single-node tools into one that can be applied in a distributed fashion, on a large amount of data.

The following code creates a Spark prediction pipeline using InceptionV3, a state-of-the-art convolutional neural network (CNN) model for image classification, and predicts what objects are in the images that we just loaded.


Let’s take a look to the predictions dataframe:"predicted_labels").show(truncate=False,n=3)"predicted_labels").show(truncate=False,n=3)
|predicted_labels|                                                                                                                                                                                                                                                                                                                                            |                |
|[[n03930313, picket_fence, 0.1424783], [n11939491, daisy, 0.10951301], [n03991062, pot, 0.04505], [n02206856, bee, 0.03734662], [n02280649, cabbage_butterfly, 0.019011213], [n13133613, ear, 0.017185668], [n02219486, ant, 0.014198389], [n02281406, sulphur_butterfly, 0.013113698], [n12620546, hip, 0.012272579], [n03457902, greenhouse, 0.011370744]]            |
|[[n11939491, daisy, 0.9532104], [n02219486, ant, 6.175268E-4], [n02206856, bee, 5.1203516E-4], [n02190166, fly, 4.0093894E-4], [n02165456, ladybug, 3.70687E-4], [n02281406, sulphur_butterfly, 3.0587992E-4], [n02112018, Pomeranian, 2.9011074E-4], [n01795545, black_grouse, 2.5667972E-4], [n02177972, weevil, 2.4875381E-4], [n07745940, strawberry, 2.3729511E-4]]|
|[[n11939491, daisy, 0.89181453], [n02219486, ant, 0.0012404523], [n02206856, bee, 8.13047E-4], [n02190166, fly, 6.03804E-4], [n02165456, ladybug, 6.005444E-4], [n02281406, sulphur_butterfly, 5.32096E-4], [n04599235, wool, 4.6653638E-4], [n02112018, Pomeranian, 4.625338E-4], [n07930864, cup, 4.400617E-4], [n02177972, weevil, 4.2434104E-4]]                    |
only showing top 3 rows

Notice that the predicted_labels column shows "daisy" as a high probability class for all of sample flowers using this base model, for some reason the tulip was closer to a picket fence than to a flower (maybe because of the background of the photo).

However, as can be seen from the differences in the probability values, the neural network has the information to discern the two flower types. Hence our transfer learning example above was able to properly learn the differences between daisies and tulips starting from the base model.

Let’s see how well our model discern the type of the flower:


For Keras users

For applying Keras models in a distributed manner using Spark, KerasImageFileTransformer works on TensorFlow-backed Keras models. It

  • Internally creates a DataFrame containing a column of images by applying the user-specified image loading and processing function to the input DataFrame containing a column of image URIs
  • Loads a Keras model from the given model file path
  • Applies the model to the image DataFrame

To use the transformer, we first need to have a Keras model stored as a file. For this notebook we’ll just save the Keras built-in InceptionV3 model instead of training one.


Now we will create a Keras transformer but first we will preprocess the images to work with it


We will read now the images and load them into a Spark Dataframe and them use our transformer to apply the model into the images:


If we take a look of this dataframe with predictions we see a lot of informations, and that’s just the probability of each class in the InceptionV3 model.

Working with general tensors

Deep Learning Pipelines also provides ways to apply models with tensor inputs (up to 2 dimensions), written in popular deep learning libraries:

  • TensorFlow graphs
  • Keras models

In this article we will focus only in the Keras models. The KerasTransformerapplies a TensorFlow-backed Keras model to tensor inputs of up to 2 dimensions. It loads a Keras model from a given model file path and applies the model to a column of arrays (where an array corresponds to a Tensor), outputting a column of arrays.
|  predictions|            features|
| [0.86104786]|[-0.76344526, 0.2...|
| [0.21693115]|[0.41084298, 0.93...|
|[0.057743043]|[0.062970825, 0.3...|
| [0.43409333]|[-0.43408343, -1....|
| [0.43690935]|[-0.89413625, 0.8...|
| [0.49984664]|[-0.82052463, -0....|
|  [0.6204273]|[-0.5075533, 0.54...|
|  [0.2285336]|[0.016106872, -0....|
| [0.37478408]|[-1.6756374, 0.84...|
|  [0.2997861]|[-0.34952268, 1.2...|
|  [0.3885377]|[0.1639214, -0.22...|
|  [0.5006814]|[0.91551965, -0.3...|
| [0.20518135]|[-1.2620118, -0.4...|
| [0.18882117]|[-0.14812712, 0.8...|
| [0.49993372]|[1.4617485, -0.33...|
| [0.42390883]|[-0.877813, 0.603...|
|  [0.5232896]|[-0.031451378, -1...|
| [0.45858437]|[0.9310042, -1.77...|
| [0.49794272]|[-0.37061003, -1....|
|  [0.2543479]|[0.41954428, 1.88...|
only showing top 20 rows

Deploying Models in SQL

One way to productionize a model is to deploy it as a Spark SQL User Defined Function, which allows anyone who knows SQL to use it. Deep Learning Pipelines provides mechanisms to take a deep learning model and register a Spark SQL User Defined Function (UDF). In particular, Deep Learning Pipelines 0.2.0 adds support for creating SQL UDFs from Keras models that work on image data.

The resulting UDF takes a column (formatted as a image struct “SpImage”) and produces the output of the given Keras model; e.g. for Inception V3, it produces a real valued score vector over the ImageNet object categories.


In Keras workflows dealing with images, it’s common to have preprocessing steps before the model is applied to the image. If our workflow requires preprocessing, we can optionally provide a preprocessing function to UDF registration. The preprocessor should take in a filepath and return an image array; below is a simple example.


Once a UDF has been registered, it can be used in a SQL query:


This is very powerful. Once a data scientist builds the desired model, Deep Learning Pipelines makes it simple to expose it as a function in SQL, so anyone in their organization can use it — data engineers, data scientists, business analysts, anybody.

sparkdl.registerKerasUDF("awesome_dl_model", "/mymodels/businessmodel.h5")

Next, any user in the organization can apply prediction in SQL:

SELECT image, awesome_dl_model(image) label FROM images 
WHERE contains(label, “Product”)

In the next part I’ll discuss Distributed Hyperparameter Tuning with Spark, and will try new models and examples :).

If you want to contact me make sure to follow me on twitter:

Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…

and LinkedIn:

Favio Vázquez — Data Scientist / Tools Manager MX — BBVA Data & Analytics | LinkedIn
View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 13 jobs jobs listed on…

Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. Right now he is working on data science, machine learning and big data as the Principal Data Scientist at Oxxo. Also, he is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.

Original. Reposted with permission.