How to Handle Large Text Inputs with Longformer and Hugging Face Transformers

Learn how to use Longformer for long input text with Hugging Face Transfomers.



How to Handle Large Text Inputs with Longformer and Hugging Face Transformers
Image by Editor | Midjourney

 

Let’s learn how to handle large text inputs in the Large Language Model (LLM).
 

Preparation

 
Ensure you have the Transformers and datasets package from Hugging Face installed in your environment. If not, you can install them via pip using the following code:

pip install transformers datasets

 

Additionally, you should install the PyTorch package by selecting the suitable version for your environment.

With the package installed, we will get into the next part.

 

Using Longformer and Hugging Face Transformers

 

Longformer is a Transformer architecture modified to process longer sequences or text inputs. In a normal model based on Transformer, it’s often limited to around 512 tokens (such as BERT), but Longformer can address the long sequence problem, processing up to 4096 tokens.

Hugging Face Transformers can utilize the Longformer base model for downstream tasks, such as text generation or classification, that accept longer inputs.

We would start by using an IMDB example dataset for review classification.

from datasets import load_dataset

dataset = load_dataset("imdb", split="train[:1000]+test[:1000]")

train_test_split = dataset.train_test_split(test_size=0.5)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

 

After downloading the dataset, we will download the Longformer model and the tokenizer using the code below. We will also tokenize the data immediately.

from transformers import LongformerTokenizer

longformer_model= "allenai/longformer-base-4096"
tokenizer = LongformerTokenizer.from_pretrained(longformer_model)

def preprocess_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

 

Next, we would prepare the model for the fine-tuning process.

from transformers import DataCollatorWithPadding, LongformerForSequenceClassification

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = LongformerForSequenceClassification.from_pretrained(longformer_model, num_labels=2)

 

We use the dynamic padding and the Longformer sequence classification transformer model to fine-tune.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

 

Once the training is done, we can evaluate them.

results = trainer.evaluate()

 

Output>>
{'eval_loss': 4.9578986363485456e-05, 'eval_runtime': 35.7005, 'eval_samples_per_second': 28.011, 'eval_steps_per_second': 14.005, 'epoch': 1.0}

 

The result might not be the best as we only do one iteration, but that’s the overall process for using Longformer from the Hugging Face transformer.

Let's test the model with Longformer.

import torch

long_review = “your review here”

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

inputs = tokenizer(long_review, return_tensors="pt", max_length=512, padding='max_length', truncation=True).to(device)

 

Input and tokenize your review with longer tokens, then use the code below to see the prediction classification result.

model.eval()

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    prediction = logits.argmax(-1)

label = "positive" if prediction.item() == 1 else "negative"
print(label)

 

Try to master the Longformer if you want a model that can handle long input.

 

Additional Resouces

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!