How to Summarize Scientific Papers Using the BART Model with Hugging Face Transformers

Learn how to perform paper summarization with BART.

By Cornellius Yudha Wijaya, KDnuggets Technical Content Specialist on January 30, 2025 in Language Models

Summarize Scientific Papers Using the BART Model

Image by Editor (Kanwal Mehreen) | Canva

Scientific papers are sometimes hard to understand because of the complex structure and longer text, which makes us unable to know where to start. Luckily, we can use Language Models to simplify the reading process by summarizing them.

In this article, we will explore how to summarize scientific papers using the BART Model. So, let’s get into it.

Preparation

To follow the tutorial, we will need to install the following packages.

pip install transformers pymupdf

Then, you must install the PyTorch package, which could work in your environment.

With the package installed, we will get into the next part.

Scientific Paper Summarization with BART

BART (Bidirectional and Auto-Regressive Transformers) is a transformer-based neural network model developed by Facebook (currently called Meta) for sequence-to-sequence tasks such as summarization.

BART architecture is based on a bidirectional encoder that understands the input text content while using an autoregressive encoder to generate relevant output sequences. The model is also trained with noisy input text and learns to reconstruct the original text based on it.

We will try the model out, as it is good for summarizing scientific papers. For the tutorial, we will use the PDF of the Attention Is All You Need paper.

First, let’s extract all the text from the scientific paper using the following code.

import fitz  

def extract_paper_text(pdf_path):
    text = ""
    doc = fitz.open(pdf_path)
    for page in doc:
        text += page.get_text()
    return text

pdf_path = "attention_is_all_you_need.pdf"
cleaned_text = extract_paper_text(pdf_path)

All the text has been extracted, and we will pass it into the BART model for summarization. Let’s try out the following code. In this code, we will take token chunks instead and summarize them while joining all the summaries to make the output more coherent.

from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

def summarize_text(text, model, tokenizer, max_chunk_size=1024):
    chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
    summaries = []
    for chunk in chunks:
        inputs = tokenizer(chunk, max_length=max_chunk_size, return_tensors="pt", truncation=True)
        summary_ids = model.generate(
            inputs["input_ids"],
            max_length=200,
            min_length=50,
            length_penalty=2.0,
            num_beams=4,
            early_stopping=True
        )
        summaries.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
    return " ".join(summaries)

summary = summarize_text(cleaned_text, model, tokenizer)

The result will be a long summary as we get the output of around 200 tokens per chunk of 1024 words. To make the summarization much more neat, we will perform hierarchical summarization, in which we summarize the first summary we have.

To do that, we will add additional code like below.

def hierarchical_summarization(text, model, tokenizer, max_chunk_size=1024):
    first_level_summary = summarize_text(text, model, tokenizer, max_chunk_size)
   
    inputs = tokenizer(first_level_summary, max_length=max_chunk_size, return_tensors="pt", truncation=True)
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=200,
        min_length=50,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    final_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
   
    return final_summary

final_summary = hierarchical_summarization(cleaned_text, model, tokenizer)

Output:

The Transformer is the first transduction model relying solely on self-attention to compute representations. It can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. The attention function can be described as mapping a query and a set of key-value pairs to an output.

The summarization result is quite good, and it pinpoints a few main parts of the paper. You can play around with the chunk size to improve the summarization quality.

I hope this has helped!

Additional Resouces

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

How to Summarize Scientific Papers Using the BART Model with Hugging Face Transformers

Preparation

Scientific Paper Summarization with BART

Additional Resouces

More On This Topic

Latest Posts

Top Posts