How to Handle Large Text Inputs with Longformer and Hugging Face Transformers
Learn how to use Longformer for long input text with Hugging Face Transfomers.

Image by Editor | Midjourney
Let’s learn how to handle large text inputs in the Large Language Model (LLM).
Preparation
Ensure you have the Transformers and datasets package from Hugging Face installed in your environment. If not, you can install them via pip using the following code:
pip install transformers datasets
Additionally, you should install the PyTorch package by selecting the suitable version for your environment.
With the package installed, we will get into the next part.
Using Longformer and Hugging Face Transformers
Longformer is a Transformer architecture modified to process longer sequences or text inputs. In a normal model based on Transformer, it’s often limited to around 512 tokens (such as BERT), but Longformer can address the long sequence problem, processing up to 4096 tokens.
Hugging Face Transformers can utilize the Longformer base model for downstream tasks, such as text generation or classification, that accept longer inputs.
We would start by using an IMDB example dataset for review classification.
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]+test[:1000]")
train_test_split = dataset.train_test_split(test_size=0.5)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']
After downloading the dataset, we will download the Longformer model and the tokenizer using the code below. We will also tokenize the data immediately.
from transformers import LongformerTokenizer
longformer_model= "allenai/longformer-base-4096"
tokenizer = LongformerTokenizer.from_pretrained(longformer_model)
def preprocess_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)
Next, we would prepare the model for the fine-tuning process.
from transformers import DataCollatorWithPadding, LongformerForSequenceClassification
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = LongformerForSequenceClassification.from_pretrained(longformer_model, num_labels=2)
We use the dynamic padding and the Longformer sequence classification transformer model to fine-tune.
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=1,
weight_decay=0.01,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
Once the training is done, we can evaluate them.
results = trainer.evaluate()
Output>>
{'eval_loss': 4.9578986363485456e-05, 'eval_runtime': 35.7005, 'eval_samples_per_second': 28.011, 'eval_steps_per_second': 14.005, 'epoch': 1.0}
The result might not be the best as we only do one iteration, but that’s the overall process for using Longformer from the Hugging Face transformer.
Let's test the model with Longformer.
import torch
long_review = “your review here”
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = tokenizer(long_review, return_tensors="pt", max_length=512, padding='max_length', truncation=True).to(device)
Input and tokenize your review with longer tokens, then use the code below to see the prediction classification result.
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
prediction = logits.argmax(-1)
label = "positive" if prediction.item() == 1 else "negative"
print(label)
Try to master the Longformer if you want a model that can handle long input.
Additional Resouces
- Inferencing the Transformer Model
- Simple NLP Pipelines with HuggingFace Transformers
- How to Fine-Tune BERT for Sentiment Analysis with Hugging Face Transformers
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.