Deep Learning With Apache Spark: Part 2

In this article I’ll continue the discussion on Deep Learning with Apache Spark. I will focus entirely on the DL pipelines library and how to use it from scratch.



Applying Deep Learning models at scale

Deep Learning Pipelines supports running pre-trained models in a distributed manner with Spark, available in both batch and streaming data processing.

It houses some of the most popular models, enabling users to start using deep learning without the costly step of training a model. The predictions of the model, of course, is done in parallel with all the benefits that come with Spark.

In addition to using the built-in models, users can plug in Keras models and TensorFlow Graphs in a Spark prediction pipeline. This turns any single-node models on single-node tools into one that can be applied in a distributed fashion, on a large amount of data.

The following code creates a Spark prediction pipeline using InceptionV3, a state-of-the-art convolutional neural network (CNN) model for image classification, and predicts what objects are in the images that we just loaded.


https://gist.github.com/FavioVazquez/b6e4ab8787f4bd4a7186d858a86c3521

 

Let’s take a look to the predictions dataframe:

predictions_df.select("predicted_labels").show(truncate=False,n=3)


predictions_df.select("predicted_labels").show(truncate=False,n=3)
+----------------+
|predicted_labels|                                                                                                                                                                                                                                                                                                                                            |                |
+----------------+
|[[n03930313, picket_fence, 0.1424783], [n11939491, daisy, 0.10951301], [n03991062, pot, 0.04505], [n02206856, bee, 0.03734662], [n02280649, cabbage_butterfly, 0.019011213], [n13133613, ear, 0.017185668], [n02219486, ant, 0.014198389], [n02281406, sulphur_butterfly, 0.013113698], [n12620546, hip, 0.012272579], [n03457902, greenhouse, 0.011370744]]            |
|[[n11939491, daisy, 0.9532104], [n02219486, ant, 6.175268E-4], [n02206856, bee, 5.1203516E-4], [n02190166, fly, 4.0093894E-4], [n02165456, ladybug, 3.70687E-4], [n02281406, sulphur_butterfly, 3.0587992E-4], [n02112018, Pomeranian, 2.9011074E-4], [n01795545, black_grouse, 2.5667972E-4], [n02177972, weevil, 2.4875381E-4], [n07745940, strawberry, 2.3729511E-4]]|
|[[n11939491, daisy, 0.89181453], [n02219486, ant, 0.0012404523], [n02206856, bee, 8.13047E-4], [n02190166, fly, 6.03804E-4], [n02165456, ladybug, 6.005444E-4], [n02281406, sulphur_butterfly, 5.32096E-4], [n04599235, wool, 4.6653638E-4], [n02112018, Pomeranian, 4.625338E-4], [n07930864, cup, 4.400617E-4], [n02177972, weevil, 4.2434104E-4]]                    |
+----------------+
only showing top 3 rows


Notice that the predicted_labels column shows "daisy" as a high probability class for all of sample flowers using this base model, for some reason the tulip was closer to a picket fence than to a flower (maybe because of the background of the photo).

However, as can be seen from the differences in the probability values, the neural network has the information to discern the two flower types. Hence our transfer learning example above was able to properly learn the differences between daisies and tulips starting from the base model.

Let’s see how well our model discern the type of the flower:


https://gist.github.com/FavioVazquez/271c069453b5917d85aeec0001d54624

 

For Keras users

For applying Keras models in a distributed manner using Spark, KerasImageFileTransformer works on TensorFlow-backed Keras models. It

  • Internally creates a DataFrame containing a column of images by applying the user-specified image loading and processing function to the input DataFrame containing a column of image URIs
  • Loads a Keras model from the given model file path
  • Applies the model to the image DataFrame

To use the transformer, we first need to have a Keras model stored as a file. For this notebook we’ll just save the Keras built-in InceptionV3 model instead of training one.


https://gist.github.com/FavioVazquez/bc7d280cd98a7112cb96f13cded20259

 

Now we will create a Keras transformer but first we will preprocess the images to work with it


https://gist.github.com/FavioVazquez/b1a43d8611e1fd2db9a3c61742156e97

 

We will read now the images and load them into a Spark Dataframe and them use our transformer to apply the model into the images:


https://gist.github.com/FavioVazquez/531c2852f936e4a2cbbe2f4afbad47d5

 

If we take a look of this dataframe with predictions we see a lot of informations, and that’s just the probability of each class in the InceptionV3 model.

Working with general tensors

Deep Learning Pipelines also provides ways to apply models with tensor inputs (up to 2 dimensions), written in popular deep learning libraries:

  • TensorFlow graphs
  • Keras models

In this article we will focus only in the Keras models. The KerasTransformerapplies a TensorFlow-backed Keras model to tensor inputs of up to 2 dimensions. It loads a Keras model from a given model file path and applies the model to a column of arrays (where an array corresponds to a Tensor), outputting a column of arrays.


https://gist.github.com/FavioVazquez/bab4fbf9c39aade9b92dbbea95127cec

 

final_df.show()
+-------------+--------------------+
|  predictions|            features|
+-------------+--------------------+
| [0.86104786]|[-0.76344526, 0.2...|
| [0.21693115]|[0.41084298, 0.93...|
|[0.057743043]|[0.062970825, 0.3...|
| [0.43409333]|[-0.43408343, -1....|
| [0.43690935]|[-0.89413625, 0.8...|
| [0.49984664]|[-0.82052463, -0....|
|  [0.6204273]|[-0.5075533, 0.54...|
|  [0.2285336]|[0.016106872, -0....|
| [0.37478408]|[-1.6756374, 0.84...|
|  [0.2997861]|[-0.34952268, 1.2...|
|  [0.3885377]|[0.1639214, -0.22...|
|  [0.5006814]|[0.91551965, -0.3...|
| [0.20518135]|[-1.2620118, -0.4...|
| [0.18882117]|[-0.14812712, 0.8...|
| [0.49993372]|[1.4617485, -0.33...|
| [0.42390883]|[-0.877813, 0.603...|
|  [0.5232896]|[-0.031451378, -1...|
| [0.45858437]|[0.9310042, -1.77...|
| [0.49794272]|[-0.37061003, -1....|
|  [0.2543479]|[0.41954428, 1.88...|
+-------------+--------------------+
only showing top 20 rows


Deploying Models in SQL

One way to productionize a model is to deploy it as a Spark SQL User Defined Function, which allows anyone who knows SQL to use it. Deep Learning Pipelines provides mechanisms to take a deep learning model and register a Spark SQL User Defined Function (UDF). In particular, Deep Learning Pipelines 0.2.0 adds support for creating SQL UDFs from Keras models that work on image data.

The resulting UDF takes a column (formatted as a image struct “SpImage”) and produces the output of the given Keras model; e.g. for Inception V3, it produces a real valued score vector over the ImageNet object categories.


https://gist.github.com/FavioVazquez/3a36edf25a289f4ee31cff1bf3857467

 

In Keras workflows dealing with images, it’s common to have preprocessing steps before the model is applied to the image. If our workflow requires preprocessing, we can optionally provide a preprocessing function to UDF registration. The preprocessor should take in a filepath and return an image array; below is a simple example.


https://gist.github.com/FavioVazquez/a02094a5848ab1f7e42ce52820a09fbe

 

Once a UDF has been registered, it can be used in a SQL query:


https://gist.github.com/FavioVazquez/af566a98d19952eb0b61938c4752f7dc

 

This is very powerful. Once a data scientist builds the desired model, Deep Learning Pipelines makes it simple to expose it as a function in SQL, so anyone in their organization can use it — data engineers, data scientists, business analysts, anybody.

sparkdl.registerKerasUDF("awesome_dl_model", "/mymodels/businessmodel.h5")


Next, any user in the organization can apply prediction in SQL:

SELECT image, awesome_dl_model(image) label FROM images 
WHERE contains(label, “Product”)


In the next part I’ll discuss Distributed Hyperparameter Tuning with Spark, and will try new models and examples :).

If you want to contact me make sure to follow me on twitter:

Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…twitter.com

and LinkedIn:

Favio Vázquez — Data Scientist / Tools Manager MX — BBVA Data & Analytics | LinkedIn
View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 13 jobs jobs listed on…www.linkedin.com

 
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. Right now he is working on data science, machine learning and big data as the Principal Data Scientist at Oxxo. Also, he is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.

Original. Reposted with permission.

Related: