A Primer on Web Scraping in R
If you are a data scientist who wants to capture data from such web pages then you wouldn’t want to be the one to open all these pages manually and scrape the web pages one by one. To push away the boundaries limiting data scientists from accessing such data from web pages, there are packages available in R.
We see that the content_data has a length of 38. However, the website shows that there are only 11 paragraphs in the main content. Additional paragraphs which are captured are actually the comments, likes and other content after the main blog post. For our purposes, we will only read the first 11 values of the content data and not use the remaining text in our data frame.
Since we have captured the comments section, let us see how many comments were made. The selector gadget helps us to know that the .fn tag can be used to note the names of people who commented on the article.
This is consistent with the article where Gautam Kumar, the author of the article and pgdbaunofficial, the page owner made multiple comments. We will now try to convert our data into data frame.
This is a simple data frame with only five columns - Date, title, description, content and number of commenters. As long as we remain on the same website, the same code can be suitably reused for all the articles. However, for a different website, we may need a different piece of code. Let’s try another blog from the same blog first. The link is https://pgdbablog.wordpress.com/2015/12/18/pgdba-chronicles-first-semester/
This one has six titles - the first one is the summary, the next four are captions to the images and the last is the title heading. We can also capture the content and the comments similarly. Let’s try to capture the images which are new in this post. The selector gadget shows the images have tags from ‘.wp-image-51’ to ‘.wp-image-54’. Let’s download the last image. I am going to use an alternative approach where I set an html_session using the url
The Final Deed - Scraping Multiple Content
As the last webpage, we will move out of the blog and use a more content rich page. This time, we will capture the data from moneyball page in the imdb website. The link which we need is: http://www.imdb.com/title/tt1210166/
Imdb stores its content into well organized tags such as #titleDetails, #titleDidYouKnow, #titleCast, etc. This makes it easy to scrape the page by specifying whichever content we need. The cast is also displayed in the form of a table and we can use a table tag to capture the cast. Let’s capture the cast using the tag versus using the table tag.
We see that there are no major differences with the only one being that the cast_table is formatted in the form of a table because we used html_table function instead of html_tag function.
The Beginning of Web Scraping
We can do a lot with web scraping if we know the right way to do it. The rvest package makes it very easy to scrape pages and capture content in the form of data frames or files. Besides scraping blogs and rating websites, we can also automate mundane tasks such as scrape jobs from job websites or content from LinkedIn. Most of the tasks which are focussed on web scraping target getting data from the web pages and then using them for analysis. We can also scrape pages by using xml instead of html selectors as an alternative. In this case, the ‘table’ tag will become the //table xml tag and data can be scraped in a similar fashion. The data captured, when converted to data frames can be then used for analysis and get more knowledge about what is happening on social media today. In the end, the process remains the same - find the web page, identify the tags to be captured, convert them to text and store them into a data frame. I’m sure this article made web scraping easier than when you first started reading it.
This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article.
Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.
- 10 Tools to Help You Learn R
- Learn Generalized Linear Models (GLM) using R
- Tips for Getting Started with Text Mining in R and Python