Simple NLP Pipelines with HuggingFace Transformers

Transformers by HuggingFace is an all-encompassing library with state-of-the-art pre-trained models and easy-to-use tools.

By Yesha Shastri, AI Developer and Writer on February 16, 2023 in Machine Learning

Simple NLP Pipelines with HuggingFace Transformers

Image by Editor

An extensive package providing APIs and user-friendly tools to work with state-of-the-art pretrained models across language, vision, audio, and multi-modal modalities is what transformers by HuggingFace is all about. It consists of more than 170 pretrained models and supports frameworks such as PyTorch, TensorFlow, and JAX with the ability to interoperate among them in between code. The library is also deployment friendly as it allows the conversion of models to ONNX and TorchScript formats.

In this blog, we will particularly explore the pipelines functionality of transformers which can be easily used for inference. Pipelines provide an abstraction of the complicated code and offer simple API for several tasks such as Text Summarization, Question Answering, Named Entity Recognition, Text Generation, and Text Classification to name a few. The best thing about these APIs is that all the tasks from preprocessing to model evaluation can be performed with just a few lines of code without requiring heavy computational resources.

Now, let’s dive right into it!

The first step is to install the transformers package with the following command -

!pip install transformers

Next, we will use the pipeline structure to implement different tasks.

from transformers import pipeline

The pipeline allows to specify multiple parameters such as task, model, device, batch size, and other task specific parameters.

Let’s begin with the first task.

1. Text Summarization

The input to this task is a corpus of text and the model will output a summary of it based on the expected length mentioned in the parameters. Here, we have kept minimum length as 5 and maximum length as 30.

summarizer = pipeline(
    "summarization", model="t5-base", tokenizer="t5-base", framework="tf"
)

input = "Parents need to know that Top Gun is a blockbuster 1980s action thriller starring Tom Cruise that's chock full of narrow escapes, chases, and battles. But there are also violent and upsetting scenes, particularly the death of a main character, which make it too intense for younger kids. There's also one graphic-for-its-time sex scene (though no explicit nudity) and quite a few shirtless men in locker rooms and, in one iconic sequence, on a beach volleyball court. Winning is the most important thing to all the pilots, who try to intimidate one another with plenty of posturing and banter -- though when push comes to shove, loyalty and friendship have important roles to play, too. While sexism is noticeable and almost all characters are men, two strong women help keep some of the objectification in check."

summarizer(input, min_length=5, max_length=30)

Output:

[
    {
        "summary_text": "1980s action thriller starring Tom Cruise is chock-full of escapes, chases, battles "
    }
]

One can also choose from the other options of models that have been fine-tuned for the summarization task - bart-large-cnn, t5-small, t5-large, t5-3b, t5-11b. You can check out the complete list of available models here.

2. Question Answering

In this task, we provide a question and a context. The model will choose the answer from the context based on the highest probability score. It also provides the starting and ending positions of the text.

qa_pipeline = pipeline(model="deepset/roberta-base-squad2")

qa_pipeline(
    question="Where do I work?",
    context="I work as a Data Scientist at a lab in University of Montreal. I like to develop my own algorithms.",
)

Output:

{
    "score": 0.6422629356384277,
    "start": 39,
    "end": 61,
    "answer": "University of Montreal",
}

Refer here, to check the full list of available models for the Question-Answering task.

3. Named Entity Recognition

Named Entity Recognition deals with identifying and classifying the words based on the names of persons, organizations, locations and so on. The input is basically a sentence and the model will determine the named entity along with its category and its corresponding location in the text.

ner_classifier = pipeline(
    model="dslim/bert-base-NER-uncased", aggregation_strategy="simple"
)
sentence = "I like to travel in Montreal."
entity = ner_classifier(sentence)
print(entity)

Output:

[
    {
        "entity_group": "LOC",
        "score": 0.9976745,
        "word": "montreal",
        "start": 20,
        "end": 28,
    }
]

Check out here, for other options of available models.

4. Part-of-Speech Tagging

PoS Tagging is useful to classify the text and provide its relevant parts of speech such as whether a word is a noun, pronoun, verb and so on. The model returns PoS tagged words along with their probability scores and respective locations.

pos_tagger = pipeline(
    model="vblagoje/bert-english-uncased-finetuned-pos",
    aggregation_strategy="simple",
)
pos_tagger("I am an artist and I live in Dublin")

Output:

[
    {
        "entity_group": "PRON",
        "score": 0.9994804,
        "word": "i",
        "start": 0,
        "end": 1,
    },
    {
        "entity_group": "VERB",
        "score": 0.9970591,
        "word": "live",
        "start": 2,
        "end": 6,
    },
    {
        "entity_group": "ADP",
        "score": 0.9993111,
        "word": "in",
        "start": 7,
        "end": 9,
    },
    {
        "entity_group": "PROPN",
        "score": 0.99831414,
        "word": "dublin",
        "start": 10,
        "end": 16,
    },
]

5. Text Classification

We will perform sentiment analysis and classify the text based on the tone.

text_classifier = pipeline(
    model="distilbert-base-uncased-finetuned-sst-2-english"
)
text_classifier("This movie is horrible!")

Output:

[{'label': 'NEGATIVE', 'score': 0.9997865557670593}]

Let’s try a few more examples.

text_classifier("I loved the narration of the movie!")

Output:

[{'label': 'POSITIVE', 'score': 0.9998612403869629}]

The full list of models for text classification can be found here.

6. Text Generation:

text_generator = pipeline(model="gpt2")
text_generator("If it is sunny today then ", do_sample=False)

Output:

[
    {
        "generated_text": "If it is sunny today then \xa0it will be cloudy tomorrow."
    }
]

Access the full list of models for text generation here.

7. Text Translation:

Here, we will translate the language of text from one language to another. For example, we have chosen translation from English to French. We have used the basic t5-small model but you can access other advanced models here.

en_fr_translator = pipeline("translation_en_to_fr", model='t5-small')
en_fr_translator("Hi, How are you?")

Output:

[{'translation_text': 'Bonjour, Comment êtes-vous ?'}]

Conclusion

You reached till the end, awesome! If you have followed along, you learned how to create basic NLP pipelines with Transformers. Refer to the official documentation by HuggingFace in order to check out other interesting applications in NLP such as Zero Shot Text Classification or Table Question Answering. In order to work with your own datasets or implement models from other domains such as vision, audio, or multimodal, check out here.

Yesha Shastri is a passionate AI developer and writer pursuing Master’s in Machine Learning from Université de Montréal. Yesha is intrigued to explore responsible AI techniques to solve challenges that benefit society and share her learnings with the community.

Simple NLP Pipelines with HuggingFace Transformers

1. Text Summarization

2. Question Answering

3. Named Entity Recognition

4. Part-of-Speech Tagging

5. Text Classification

6. Text Generation:

7. Text Translation:

Conclusion

More On This Topic

Latest Posts

Top Posts