Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2018 » Dec » Tutorials, Overviews » Automated Web Scraping in R ( 18:n47 )

Automated Web Scraping in R


How to automatically web scrape periodically so you can analyze timely/frequently updated data.



Sponsored Post.
By Rebecca Merrett, Instructor at Data Science Dojo
There are many blogs and tutorials that teach you how to scrape data from a bunch of web pages once and then you’re done. But one-off web scraping is not useful for many applications that require sentiment analysis on recent or timely content, or capturing changing events and commentary, or analyzing trends in real time. As fun as it is to do an academic exercise of web scraping for one-off analysis on historical data, it is not useful when wanting to use timely or frequently updated data.

Scenario: You would like to tap into news sources to analyze the political events that are changing by the hour and people’s comments on these events. These events could be analyzed to summarize the key discussions and debates in the comments, rate the overall sentiment of the comments, find the key themes in the headlines, see how events and commentary change over time, and more. You need a collection of recent political events or news scraped every hour so that you can analyze these events.

Let’s go fetch your data!

Example source:

Reddit’s r/politics is a repository of political news from a variety of news sites and includes comments or discussion on the news. Content is added and updated at least every hour.

Tool:

R’s rvest library is an easy-to-use tool for web scraping content within html tags. To install rvest run this command in R:

install.packages("rvest")

A quick demonstration of rvest

The below commands scrape the news headline and comments on a Reddit page.

library(rvest)

reddit_wbpg<- read_html("https://www.reddit.com/r/politics/comments/a1j9xs/partisan_election_officials_are_inherently_unfair/")

reddit_wbpg %>%

html_node("title") %>%

html_text()

[1] "Partisan Election Officials Are 'Inherently Unfair' But Probably Here To Stay : politics"

 

reddit_wbpg %>%

html_nodes("p.yklcuq-10") %>%

html_text()



[5] "How is American Express never hacked?"                                                                                                                                                                                          [6] "Let’s use their system"


How did we grab this text? We grabbed the text between the relevant HTML tags and classes. Right click on the web page and select View page source to search for the text and find the relevant HTML tags.

DSD Html Source

Web scraping – let’s go!

The web scraping program we are going to write will:

  • Grab the URL and time of the latest Reddit pages added to r/politics
  • Filter the pages down to those that are marked as published no more than an hour ago
  • Loop through each filtered page and scrape the main head and comments from each page
  • Create a dataframe containing the Reddit news headline and each comment belonging to that headline.

Once the data is in a dataframe, you are then free to plug these data into your analysis function.

Step 1

First, we need to load rvest into R and read in our Reddit political news data source.

library(rvest)

reddit_political_news<- read_html("https://www.reddit.com/r/politics/new/")

 

Step 2

Next, we need to grab all times of the news pages and all corresponding URLs so we can filter these down to pages published within the hour (i.e.all pages published minutes ago).

reddit_political_news<- read_html("https://www.reddit.com/r/politics/new/")

time <- reddit_political_news %>%

html_nodes("a._3jOxDPIQ0KaOWpzvSQo-1s") %>%

html_text()

time

[1] "2 minutes ago"  "4 minutes ago"  "5 minutes ago"  "10 minutes ago" [5] "11 minutes ago" "11 minutes ago" "12 minutes ago" "15 minutes ago" [9] "17 minutes ago" "21 minutes ago" "25 minutes ago" "26 minutes ago"[13] "28 minutes ago" "28 minutes ago" "32 minutes ago" "37 minutes ago"[17] "37 minutes ago" "39 minutes ago" "39 minutes ago" "40 minutes ago"[21] "43 minutes ago" "45 minutes ago" "46 minutes ago" "46 minutes ago"[25] "51 minutes ago"

 

urls<- reddit_political_news %>%

html_nodes("a._3jOxDPIQ0KaOWpzvSQo-1s") %>%

html_attr("href")

urls

[1] "https://www.reddit.com/r/politics/comments/a1y40u/trump_doesnt_want_you_to_know_but_its_time_to/"     [2] "https://www.reddit.com/r/politics/comments/a1y3ez/70_earthquake_hits_near_anchorage_alaska/"          [3] "https://www.reddit.com/r/politics/comments/a1y32s/crucial_stand_for_democracy_and_enlightened/"       [4] "https://www.reddit.com/r/politics/comments/a1y1qu/north_carolina_election_that_looked_to_be/"         [5] "https://www.reddit.com/r/politics/comments/a1y19p/hemp_is_finally_about_to_go_fully_legit_in_the_us/"


Step 3

To filter pages, we need to make a dataframe out of our ‘time’ and ‘urls’ vectors. We’ll filter our rows based on a partial match of the time marked as either ‘x minutes’ or ‘now’.

reddit_newspgs_times<- data.frame(NewsPage=urls, PublishedTime=time)

#Check the dimensions

dim(reddit_newspgs_times)

[1] 25  2


Go ahead and filter the URLs based on the time either being ‘x minutes’ or ‘now’.

reddit_recent_data<- reddit_newspgs_times[grep("minute|now", reddit_newspgs_times$PublishedTime),]

#Check the dimensions (# items will be less if not all pages were published within mins)

dim(reddit_recent_data)




[1] 25  2

Step 4

Now that we have filtered down the pages published within the hour, we are going to grab the golden nuggets of data that we plan to analyze. We’ll loop through the filtered list of URLs, grab the main head and paragraph text of the comments, and store these in their own vectors. Each comment will be its own element or item in the vector, with the corresponding news title/headline each comment belongs to.

titles <- c()

comments <- c()

for(i in reddit_recent_data$NewsPage){

reddit_recent_data<- read_html(i)

body <- reddit_recent_data %>%

html_nodes("p.yklcuq-10") %>%

html_text()

comments = append(comments, body)

reddit_recent_data<- read_html(i)

title <- reddit_recent_data %>%

html_node("title") %>%

html_text()

titles = append(titles, rep(title,each=length(body)))

}

Step 5

Last but not least, we’ll create a dataframe using the ‘titles’ and ‘comments’ vectors to get the data ready for analyzing.

reddit_hourly_data<- data.frame(Headline=titles, Comments=comments)

dim(reddit_hourly_data)

[1] 497   2

 

There are several ways you could analyze these texts, depending on your application. For example, Data Science Dojo’s free Text Analytics video series goes through an end-to-end demonstration of preparing and analyzing text to predict the class label of the text. With nearly every single web page or business document containing some text, it is worth understanding the fundamentals of data mining for text, as well as important machine learning concepts.

Automate running your web scraping script

Here’s where the real automation comes into play. So far we have completed a fairly standard web scraping task, but with the addition of filtering and grabbing content based on a time window or timeframe. This script will save us from manually fetching the data every hour ourselves. But we need to automate the whole process by running this script in the background of our computer and freeing our hands to work on more interesting tasks.

Task Scheduler in Windows offers an easy user interface to schedule a script or program to run every minute, hour, day, week, month, etc. The OSX alternative to Task Manager is Automator and the Linux alternative is GNOME Schedule. We’ll use Task Scheduler in this tutorial, but Automator and GNOME Schedule operate in a similar way to Task Scheduler.

Step 1

Go to the Action tab in Task Scheduler and select Create Task.

Dsd Step1

Step 2

Give your task a name such as ‘Web Scraper Reddit Politics’.

Select the option Run whether user is logged in or not.

Click OK.

Dsd Step2

Step 3

Go to the Actions tab and click New.

Make sure Start a program option is selected from the Action dropdown menu.

Copy the directory path where your Rcmd.exe file sits on your local computer and paste it into the Program/script box.

Copy the directory path where your R script sits in your local computer and paste it into the Add arguments box with ‘BATCH’ before your path.

Click OK.

Step 4

Go to the Conditions tab and under Power select the Wake the computer to run this task option.

Dsd Step4

Step 5

Go to the Triggers tab and click New.

Under Advanced Settings select 1 hour to repeat the task and Indefinitely for the duration.

Click OK, then OK again to exit the window.

Step 6

Click on Active Tasks in the main panel to check your web scraping task is active.

Your script will now run every hour to web scrape the latest data from Reddit’s r/politics.

What’s next?

Now that you have a program automatically fetching your data for you, what to do with the data? You could add to your R script by writing an analysis function or component so that your script not only fetches the data every hour, but also analyzes it and sends an email alert of the results of the analysis.

But first you need to understand how to analyze text. It’s worth getting a good grasp of text analytics or natural language processing. With so much textual data in the world under utilized not only on the web, but also in important business documents, it’s useful to have text mining skills added to your knowledge set and data science toolbox. You can learn these skills properly without having to sign up to a five-year PhD or Masters program. Data Science Dojo’s bootcamp is a five-day, hands-on course that can help you get up to speed on both text analytics and core machine learning concepts and algorithms.


Sign Up