Simple Text Scraping, Parsing, and Processing with this Python Library

Scraping, parsing, and processing text data from the web can be difficult. But it can also be easy, using Newspaper3k.



Figure
Photo by Peter Lawrence on Unsplash

 

Looking for a library to help with scraping, parsing, processing, and extracting metadata from news articles? Newspaper can help. Newspaper is a "[n]ews, full-text, and article metadata extraction in Python 3." I would say, with the utmost respect, that Newspaper is a quick and dirty text parsing and processing library. It isn't foolproof, and won't always be able to fulfill your every need with every article. It will generally do a very good job, however, and do so quickly.

Let's go ahead and get started in order to see what you can accomplish quickly and easily with the library. If you're using Python 3, installation is accomplished with:

pip install newspaper3k

Once installed, Newspaper is very easy to use.

Let's import the library, define an article on the web we want to use for processing, and download the article. We will use the recent KDnuggets article Avoid These Five Behaviors That Make You Look Like A Data Novice by Tessa Xie for these purposes.

from newspaper import Article
kdn_article = Article(url="https://www.kdnuggets.com/2021/10/avoid-five-behaviors-data-novice.html", language='en')
kdn_article.download()


Now, let's see what we have downloaded.

# Print out the raw article
print(kdn_article.html)


<!DOCTYPE html>
<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">
<head profile="https://gmpg.org/xfn/11">
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta name=viewport content="width=device-width, initial-scale=1">
  ...
<!-- Dynamic page generated in 1.581 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2021-10-28 20:02:56 -->

<!-- Compression = gzip -->


This is the entire HTML page for the article. This isn't very useful; let's take out first processing step and parse the article with Newspaper. Once we do so, let's print the text for the parsed article.

# Parse the article, print parsed content
kdn_article.parse()
print(kdn_article.text)


If you are new to the Data Science industry or a well-versed veteran in all things data and analytics, there are always key pitfalls that each of us can easily slide into if we are not careful. These behaviors not only make us appear like novices, but they can risk our position as a trustworthy, likable data partner with stakeholder.
...
Original. Reposted with permission.

Bio: Tessa Xie is a Data scientist in the AV industry, and ex-McKinsey, and 3x Top Medium Writer. Tessa is also at the tip of the data spear by day, a writer by night, and a painter, diver, and much more on the weekends.

Related:


That looks better. We have removed the HTML not related to the article we downloaded, and have otherwise extracted the useful text from within the remaining portions of HTML.

Let's see what metadata we can extract from the parsed article.

# Article title
print(kdn_article.title)


Avoid These Five Behaviors That Make You Look Like A Data Novice


# Article's top image
print(kdn_article.top_image)


https://www.kdnuggets.com/wp-content/uploads/avoid-five-behaviors-data-novice.jpg


# All article images
print(kdn_article.images)


{'https://www.kdnuggets.com/wp-content/uploads/envelope.png', 
'https://www.kdnuggets.com/wp-content/uploads/tripled-my-income-data-science-18-months-small.jpg', 
'https://www.kdnuggets.com/wp-content/uploads/avoid-five-behaviors-data-novice.jpg', 
'https://www.kdnuggets.com/images/in_c48.png', 
'https://www.kdnuggets.com/images/fb_c48.png', 
'https://www.kdnuggets.com/images/tw_c48.png', 
'https://www.kdnuggets.com/images/menu-30.png', 
'https://www.kdnuggets.com/images/search-icon.png'}


# Article author
print(kdn_article.authors)

# Article publication date
print(kdn_article.publish_date)


[]
None


You can see from the above that some metadata has been easily extracted, while in the case of the publication date and authors, Newspaper has come up empty handed. This is what I referred to at the start of the article; the library isn't magic, and so if an article is not formatted in such a way that facilitates Newspaper's pattern matching, identification and extraction of these metadata won't happen.

Find out what else you can accomplish with a parsed article here.

Moving on to something more interesting... once an article has been downloaded and parsed, it can also be processed using Newspaper's built-in NLP capabilities, which is done as follows:

# Perform higher level processing on article
kdn_article.nlp()


Here are a few tasks we can perform on a processed article.

# Article keywords
print(kdn_article.keywords)


['avoid', 'data', 'dont', 'instead', 'things', 'insights', 'novice', 'work', 'quality', 'understand', 'behaviors', 'stakeholders', 'look', 'sample']


# Article summary
print(kdn_article.summary)


If you are new to the Data Science industry or a well-versed veteran in all things data and analytics, there are always key pitfalls that each of us can easily slide into if we are not careful.
There are noticeable differences between people who are new to the data world and those who truly understand how to handle data and be helpful data partners.
So as a data expert, you should know better than trusting data quality at face value.
But in reality, unless you are an ML engineer, you rarely need 10-layer neural networks in your day-to-day data work.
Make sure to QC your data and sanity-check your insights, and always caveat findings when data quality or the sample size is a concern.


This is certainly more interesting than the parsed article metadata extraction above, though both the processed and parsed data extracted from an article could certainly be useful.

Keep in mind that there are numerous ways to automate the summarization an article with the use of a variety of different libraries and tools; however, Newspaper provides a way to do so that provides reasonable results, and does so in a single line of code without even the need for the testing of parameters. You can compare this with implementing a similar extractive summarization process, using a simple word frequency approach, in Python from scratch, in my previous article Getting Started with Automated Text Summarization; you will find that significantly more code is required for similar results.

Interested in only leveraging Newspaper's summarization functionality? Here's a quick, self-contained example:

from newspaper import Article
cnn_article = Article(url="https://www.cnn.com/2021/10/28/tech/facebook-mark-zuckerberg-keynote-announcements/index.html", language='en')
cnn_article.download()
cnn_article.parse()
cnn_article.nlp()
print(cnn_article.summary)


The company formerly known as Facebook also said in a press release that it plans to begin trading under the stock ticker "MVRS" on December 1.
Facebook is one of the most used products in the history of the world," Zuckerberg said on Thursday.
"Today we're seen as a social media company," he added, "but in our DNA, we are a company that builds technology to connect people.
But on Zuckerberg's personal Facebook page , his job title was changed to: "Founder and CEO at Meta."
When asked by The Verge if he would remain CEO at Facebook in the next 5 years, he said: "Probably.


And there you have it.

Find out what more you can accomplish with a processed article here.

Newspaper isn't perfect and has its limitations, but you can see how quickly and easily it can be invoked and leveraged, and how useful it can be even in cases where some of its limitations are met. Personally, I have written my own code to perform a number of the steps above, and have also leveraged several different libraries to accomplish some of the others, generally with more elbow grease needed to do so.

There is actually much more you can accomplish with the library, and I encourage you to investigate the possibilities. Hopefully you are able to use Newspaper for your own projects.

 
Related: