Spark NLP 101: LightPipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. Now let’s see how this can be done in Spark NLP using Annotators and Transformers.
By Veysel Kocaman, Data Scientist & ML Researcher
This is the second article in a series in which we are going to write a separate article for each annotator in the Spark NLP library. You can find all the articles at this link.
This article is mainly built on top of Introduction to Spark NLP: Foundations and Basic Components (Part-I). Please read that at first, if you want to learn more about Spark NLP and its underlying concepts.
In machine learning, it is common to run a sequence of algorithms to process and learn from data. This sequence is usually called a Pipeline.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.
Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.
- Split text into sentences
- Get word embeddings
And here is how we code this pipeline up in Spark NLP.
I am going to load a dataset and then feed it into this pipeline to let you see how it works.
After running the pipeline above, we get a trained pipeline model. Let's transform the entire DataFrame.
It took 501 ms to transform the first 20 rows. If we transform the entire data frame, it would take 11 seconds.
It looks good. What if we want to save this pipeline to disk and then deploy it to get runtime transformations on a given line of text (one row).
Transforming a single line of a short text took 515 ms! Nearly the same as it took for transforming the first 20 rows. So, it's not good. Actually, this is what happens when trying to use distributed processing on small data. Distributed processing and cluster computing are mainly useful for processing a large amount of data (aka big data). Using Spark for small data would be like getting in a fight with an ax:-)
Actually, due to its inner mechanism and optimized architecture, Spark could still be useful for the average size of data that could be handled on a single machine. But when it comes to processing just a few lines of text, it's not recommended unless you use Spark NLP.
Let us make an analogy to help you understand this. Spark is like a locomotive racing a bicycle. The bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it’s going to be faster in the end.
So, what are we going to do if we want to have a faster inference time? Here comes LightPipeline.
LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.
Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.
Here are the available methods in LightPipeline. As you can see, we can also use a list of strings as input text.
LightPipelines are easy to create and also save you from dealing with Spark Datasets. They are also very fast and, while working only on the driver node, they execute parallel computation. Let’s see how it applies to our case described above:
Now it takes 28 ms! Nearly 20x faster than using Spark ML Pipeline.
As you see, annotate return only the result attributes. Since the embedding array is stored under embeddings attribute of WordEmbeddingsModel annotator, we set parse_embeddings=True to parse the embedding array. Otherwise, we could only get the tokens attribute from embeddings in the output. For more information about the attributes mentioned, see here.
If we want to retrieve fully information of annotations, we can also use fullAnnotate() to return a dictionary list of entire annotations content.
fullAnnotate() returns the content and metadata in Annotation type. According to documentation, the Annotation type has the following attributes:
So, if we want to get the beginning and end of any sentence, we can just write:
You can even get metadata for each token with respect to embeddings.
Unfortunately, we cannot get anything from non-Spark NLP annotators via
LightPipeline. That is when we use the Spark ML feature like
word2vec inside a pipeline along with SparkNLP annotators, and then use
annotate only returns the
result from SparkNLP annotations as there is no
result field coming out of Spark ML models. So we can say that LightPipeline will not return anything from non-Spark NLP annotators. At least for now. We plan to write a Spark NLP wrapper for all the ML models in Spark ML soon. Then we will be able to use LightPipeline for a machine learning use case in which we train a model in Spark NLP and then deploy to get faster runtime predictions.
Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data. In this article, we talked about how you can convert your Spark pipelines into Spark NLP LightPipelines to get a faster response for small data. This is one of the coolest features of Spark NLP. You get to enjoy the power of Spark while processing and training, and then get faster inferences through LightPipelines as if you do that on a single machine.
We hope that you already read the previous articles on our official Medium page, and started to play with Spark NLP. Here are the links for the other articles. Don’t forget to follow our page and stay tuned!
** These articles are also being published on John Snow Labs' official blog page.
Bio: Veysel Kocaman is a Senior Data Scientist and ML Engineer having a decade long industry experience. He is currently working on his PhD in CS at Leiden University (NL) and holds an MS degree in Operations Research from Penn State University.
Original. Reposted with permission.
- The World’s Most Widely Used NLP Library In The Enterprise
- How to Create a Vocabulary for NLP Tasks in Python
- Text Encoding: A Review