Getting Started with Automated Text Summarization

This article will walk through an extractive text summarization process, using a simple word frequency approach, implemented in Python.



Figure

 

Automated text summarization refers to performing the summarization of a document or documents using some form of heuristics or statistical methods. A summary in this case is a shortened piece of text which accurately captures and conveys the most important and relevant information contained in the document or documents we want summarized.

There are 2 categories of summarization techniques: extractive and abstractive. We will focus on the use of extractive methods herein, which function by identifying the important sentences or excerpts from the text and reproducing them verbatim as part of the summary. No new text is generated; only existing text is used in the summarization process. This differs from abstractive methods, which employ more powerful natural language processing techniques to interpret text and generate new summary text.

This article will walk through an extractive summarization process, using a simple word frequency approach, implemented in Python. Before we begin, note that we are not spending much energy on data preprocessing, tokenization, normalization, etc. in this article (similar to last time), nor are we introducing any libraries which are able to easily and effectively perform these tasks. I want to focus on presenting the text summarization steps, mostly glossing over other important concepts. I am planning a number of follow-ups to this piece, and we will add increasing complexity to our NLP tasks as we go.

Also, and for example, since we are doing some minimal tokenization here, out of necessity, you will get a feel for when it is being performed, and doing so more effectively can optionally be left as an exercise for the reader.

Let's be clear about what we are going to do here:

  • Take textual input (a short news article)
  • Perform minimal text preprocessing
  • Create a data representation
  • Perform summarization using this data representation

There are a number of ways of performing text summarization, as noted above, and we will be using a very basic extractive method to do so which is based on word frequencies within the given article.

As we are not leaning on libraries for almost anything, our imports are few:

from collections import Counter 
from string import punctuation
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stop_words


We need the punctuation and stop_words modules in order to identify these when we are scoring our words and, ultimately, sentences for their perceived importance, and we will deem neither punctuation nor stop words "important" for this task. Why so? As opposed to a language modeling task where these would unquestionably be useful, or perhaps a text classification task, it should be obvious that including frequently occurring stop words or repetitive punctuation would lead to biasing towards these tokens, providing no benefit to us. There are all sorts of reasons why we would want to not exclude stop words (their arbitrary removal should be avoided), but this does not seem to be one of them.

Next, we need some text to test our summarization technique on. I manually copied and pasted this one from CNN, but feel free to find your own:

# https://www.cnn.com/2019/11/26/politics/judiciary-committee-hearing/index.html

text = """
The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President. 
The committee announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying.
House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying him of the hearing and inviting the President or his counsel to participate, including asking questions of the witnesses.
"I write to ask if you or your counsel plan to attend the hearing or make a request to question the witness panel," the New York Democrat wrote.
In the letter, Nadler said the hearing would "serve as an opportunity to discuss the historical and constitutional basis of impeachment, as well as the Framers' intent and understanding of terms like 'high crimes and misdemeanors.' "
"We expect to discuss the constitutional framework through which the House may analyze the evidence gathered in the present inquiry," Nadler added. "We will also discuss whether your alleged actions warrant the House's exercising its authority to adopt articles of impeachment."
The Judiciary Committee hearing is the latest sign that House Democrats are moving forward with impeachment proceedings against the President following the two-month investigation led by the House Intelligence Committee into allegations that Trump pushed Ukraine to investigate his political rivals while a White House meeting and $400 million in security aid were withheld from Kiev.
The hearing announcement comes as the Intelligence Committee plans to release its report summarizing the findings of its investigation to the House Judiciary Committee soon after Congress returns from its Thanksgiving recess next week.
Democratic aides declined to say what additional hearings they will schedule as part of the impeachment proceedings.
The Judiciary Committee is expected to hold multiple hearings related to impeachment, and the panel would debate and approve articles of impeachment before a vote on the House floor.
The aides said the first hearing was a "legal hearing" that would include some history of impeachment, as well as evaluating the seriousness of the allegations and the evidence against the President.
Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the proceedings. The letter was copied to White House Counsel Pat Cipollone.
"""


Did I say we weren't tokenizing? Well, we are. Poorly. But let's not focus on that right now. We will need 2 simple tokenizing functions: one for tokenizing sentences into words, and another for tokenizing documents into sentences:

def tokenizer(s):
    tokens = []
    for word in s.split(' '):
        tokens.append(word.strip().lower())
    return tokens

def sent_tokenizer(s):
    sents = []
    for sent in s.split('.'):
        sents.append(sent.strip())
    return sents


We need individual words in order to determine their relative frequency in the document, and assign a corresponding score; we need individual sentences to subsequently sum the scores of each word within in order to determine sentence "importance."

Note the we are using "importance" here as a synonym for the relative word frequency in the document; we will divide the number of occurrences of each word by the number of occurrences of the word which occurs most in the document. Does such high frequency equal genuine importance? It is naive to assume that it does, but it's also the simplest way to introduce the concept of text summarization. Interested in challenging our assumption of "importance" here? Try something like TF-IDF or word embeddings instead.

Okay, let's tokenize:

tokens = tokenizer(text)
sents = sent_tokenizer(text)

print(tokens)
print(sents)


['the', 'house', 'judiciary', 'committee', 'has', 'invited', 'president', 'donald', 'trump', 'or', 'his', 'counsel', 'to', 'participate', 'in', 'the', "panel's", 'first', 'impeachment', 'hearing', 'next', 'week', 'as', 'the', 
'house', 'moves', 'another', 'step', 'closer', 'to', 'impeaching', 'the', 'president.', 'the', 'committee', 'announced', 'that', 'it', 'would', 'hold', 'a', 'hearing', 'december', '4', 'on', 'the', '"constitutional', 'grounds', 'for', 

...

'the', 'white', 'house', 'wanted', 'to', 'participate', 'in', 'the', 'hearings,', 'as', 'well', 'as', 'who', 'would', 'act', 'as', 'the', "president's", 'counsel', 'for', 'the', 'proceedings.', 'the', 'letter', 'was', 'copied', 'to',
'white', 'house', 'counsel', 'pat', 'cipollone.']

["The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President", 'The committee
announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying', 'House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday 

...

seriousness of the allegations and the evidence against the President', "Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the
proceedings", 'The letter was copied to White House Counsel Pat Cipollone', '']


Don't look too closely if you are following along at home, or else you will see where our simple tokenization approach fails. Moving on...

Now we need to count the occurrences of each word in the document.

def count_words(tokens):
    word_counts = {}
    for token in tokens:
        if token not in stop_words and token not in punctuation:
            if token not in word_counts.keys():
                word_counts[token] = 1
            else:
                word_counts[token] += 1
    return word_counts

word_counts = count_words(tokens)
word_counts


{'house': 10,
 'judiciary': 5,
 'committee': 7,
 'invited': 1,
 'president': 3,

 ...

 "president's": 1,
 'proceedings.': 1,
 'copied': 1,
 'pat': 1,
 'cipollone.': 1}


Our poor tokenizing shows up again in the final token above. In the next article, I'll show you replacement tokenizers you can drop in place to help with this. Why not do this from the start? As I said, I want to focus on the text summarization steps.

Now that we have our word counts, we can build a word frequency distribution:

def word_freq_distribution(word_counts):
    freq_dist = {}
    max_freq = max(word_counts.values())
    for word in word_counts.keys():  
        freq_dist[word] = (word_counts[word]/max_freq)
    return freq_dist

freq_dist = word_freq_distribution(word_counts)
freq_dist


{'house': 1.0,
 'judiciary': 0.5,
 'committee': 0.7,
 'invited': 0.1,
 'president': 0.3,

 ...

 "president's": 0.1,
 'proceedings.': 0.1,
 'copied': 0.1,
 'pat': 0.1,
 'cipollone.': 0.1}


And there we go: we divided the occurrence of each word by the frequency of the most occurring word to get our distribution.

Next we want to score our sentences by using the frequency distribution we generated. This is simply summing up the scores of each word in a sentence and hanging on to the score. Our function takes a max_len argument which sets a maximum length to sentences which are to be considered for use in the summarization. It should be relatively easy to see that, given the way we are scoring our sentences, we could be biasing towards long sentences.

def score_sentences(sents, freq_dist, max_len=40):
    sent_scores = {}  
    for sent in sents:
        words = sent.split(' ')
        for word in words:
            if word.lower() in freq_dist.keys():
                if len(words) < max_len:
                    if sent not in sent_scores.keys():
                        sent_scores[sent] = freq_dist[word.lower()]
                    else:
                        sent_scores[sent] += freq_dist[word.lower()]
    return sent_scores

sent_scores = score_sentences(sents, freq_dist)
sent_scores


{"The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President": 6.899999999999999,
 'The committee announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying': 2.8000000000000007,
 'House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying him of the hearing and inviting the President or his counsel to participate, including asking questions of the witnesses': 5.099999999999999,
 '"I write to ask if you or your counsel plan to attend the hearing or make a request to question the witness panel," the New York Democrat wrote': 2.5000000000000004,
 'In the letter, Nadler said the hearing would "serve as an opportunity to discuss the historical and constitutional basis of impeachment, as well as the Framers\' intent and understanding of terms like \'high crimes and misdemeanors': 3.300000000000001,
 '\' "\n"We expect to discuss the constitutional framework through which the House may analyze the evidence gathered in the present inquiry," Nadler added': 2.7,
 '"We will also discuss whether your alleged actions warrant the House\'s exercising its authority to adopt articles of impeachment': 1.6999999999999997,
 'The hearing announcement comes as the Intelligence Committee plans to release its report summarizing the findings of its investigation to the House Judiciary Committee soon after Congress returns from its Thanksgiving recess next week': 5.399999999999999,
 'Democratic aides declined to say what additional hearings they will schedule as part of the impeachment proceedings': 1.3,
 'The Judiciary Committee is expected to hold multiple hearings related to impeachment, and the panel would debate and approve articles of impeachment before a vote on the House floor': 4.300000000000001,
 'The aides said the first hearing was a "legal hearing" that would include some history of impeachment, as well as evaluating the seriousness of the allegations and the evidence against the President': 2.8000000000000007,
 "Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the proceedings": 3.5000000000000004,
 'The letter was copied to White House Counsel Pat Cipollone': 2.2}


Now that we have scored our sentences for their importance, all that's left to do is select (i.e. extract, as in "extractive summarization") the top k sentences to represent the summary of the article. This function will take the sentence scores we generated above as well as a value for the top k highest scoring sentences to sue for summarization. It will return a string summary of the concatenated top sentences, as well as the sentence scores of the sentences used in the summarization.

def summarize(sent_scores, k):
    top_sents = Counter(sent_scores) 
    summary = ''
    scores = []
    
    top = top_sents.most_common(k)
    for t in top: 
        summary += t[0].strip()+'. '
        scores.append((t[1], t[0]))
    return summary[:-1], scores


Let's use the function to generate the summary.

summary, summary_sent_scores = summarize(sent_scores, 3)
print(summary)


The House Judiciary Committee has invited President Donald Trump or his 
counsel to participate in the panel's first impeachment hearing next week as 
the House moves another step closer to impeaching the President. The hearing 
announcement comes as the Intelligence Committee plans to release its report 
summarizing the findings of its investigation to the House Judiciary Committee 
soon after Congress returns from its Thanksgiving recess next week. House 
Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying 
him of the hearing and inviting the President or his counsel to participate, 
including asking questions of the witnesses.


And let's check out the summary sentence scores for good measure.

for score in summary_sent_scores: print(score[0], '->', score[1], '\n')


6.899999999999999 -> The House Judiciary Committee has invited President 
Donald Trump or his counsel to participate in the panel's first impeachment 
hearing next week as the House moves another step closer to impeaching the President 

5.399999999999999 -> The hearing announcement comes as the Intelligence Committee 
plans to release its report summarizing the findings of its investigation to 
the House Judiciary Committee soon after Congress returns from its Thanksgiving 
recess next week 

5.099999999999999 -> House Judiciary Chairman Jerry Nadler sent a letter to 
Trump on Tuesday notifying him of the hearing and inviting the President or 
his counsel to participate, including asking questions of the witnesses 


The summary seems reasonable at a quick pass, given the text of the article. Try out this simple method on some other text for further evidence.

The next summarization article will build on this simple method in a few key ways, namely:

  • proper tokenization approaches
  • improvement to our baseline approach, using TF-IDF weighting instead of simple word frequency
  • use of an actual dataset for our summarization
  • evaluation of our results

See you next time.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.