Web Scraping with R: Online Food Blogs Example

We consider scraping data from online food blogs to construct a data set of recipes with ingredients, nutritional information and more, and do exploratory analysis which provides tasty insights.

By Burak Himmetoglu, Computational Physicist, Data Scientist

Online Food Blogs

In this blog post I will discuss web scraping using R. As an example, I will consider scraping data from online food blogs to construct a data set of recipes. This data set contains ingredients, a short description, nutritional information and user ratings. Then, I will provide a simple exploratory analysis which provides some interesting insights.

The code and notebooks (R markdown) for the analysis and web scraping are included in my repository. If you come across this blog and have some ideas, or independent projects, please let me know for a possible collaboration.

Web Scraping

With numerous food blogs and web sites with lots of recipes, the web provides a great resource for mining food and nutrition based data. As a fun project, I took on this idea and created a simple repository containing the code for scraping food blog data. The functions that scrape the web data are in the script "utilities.R" and uses the R packages rvest, jsonlite and the tidyverse set. The website I have chosen to extract data from is called Pinch of Yum, which contains many recipes with beautiful photos accompanying them (This calls for another project idea using image recognition!).

The strategy that I used to scrape the data was to first understand the general outline of how recipes are stored in the website. Once you go the main page and click recipes, one can see that there are 50 pages (at the time I obtained the data), each containing 15 recipe links. So, I basically skimmed through the html source of the main page (which you can obtain from your browser) and identified the locations of the hyperlinks to each recipe. Then, I wrote a simple function to locate these links automatically in all the 50 pages.

Each hyperlink contains html tags that we need to remove for further processing. Below is a code snippet that exactly does that (a simplified version of the one in my repo):
trim_ <- function(link){
temp1 <- str_split(link, " ")[[1]][3] %>%
str_replace_all("\"", "") %>% # Remove \'s
str_replace("href=", "") %>%
str_replace(">", " ")
temp1 <- str_split(temp1, " ")[[1]][1]

Given a link to a recipe obtained from the html source, this function simply cleans the html tags and returns a simple text for each recipe location that we can later use to connect to. This function is used in another function below, which locates the recipes in each of the 50 pages. Again, the function below is a simplified version of the one included in my repo.

get_recipe_links <- function(page_number){
page <- read_html(paste0("pinchofyum.com/recipes?fwp_paged=", 
links <- html_nodes(page, "a")

## Get locations of recipe links loc <-which(str_detect(links, "<a class")) links <- links[loc]
## Trim the text to get proper links all_recipe_links <- map_chr(links, trim_)
## Return all_recipe_links }

This function takes as an input the page number (1 to 50) and uses the "read_html" function to get the html source code. Since each page contains 15 recipes, we need to locate the links to them. The variable "links" does that by locating the html nodes that contains links using the function "html_nodes" and selecting the nodes "a". Scanning through the html source, I realized that the recipe links contain the expression "<a class", so I used it as a regular expression to locate them and store them in the variable "loc". After that, I selected the nodes which contain the recipes (next line). Finally, using the trimming function above, all the html tags from the recipes (using the map_chr function) are removed. As a result this function returns a vector of links to each recipe in a given page.

Now, the next step is to connect to all links returned by "get_recipe_links" and then scrape the recipe data one by one. In this case I was very lucky, since the recipe data was stored in JSON format in the html source, which made the job very easy.

get_recipe_data <- function(link_to_recipe){
page <- read_html(link_to_recipe)
script_ <- html_nodes(page, "script")
loc_json <- which(str_detect(script_, "application/ld"))
recipe_data <- fromJSON(html_text(script_[loc_json]))

In this function "link_to_recipe" is a link returned from "get_recipe_links". First, the page in this link is obtained and then the location of the JSON data is located under the node "script". The JSON containing the recipe data has the expression "application/ld" which is used to locate the exact location. Then, the data is simply parsed by the "fromJSON" function. I left the rest of the code out, since it is kind of long, however easy to understand. What happens next is that features from JSON is obtained and stored in a data frame which this function returns. The only cumbersome part here was that the JSON data was not uniquely formatted across the whole site, so I had to insert many control statements to tackle with this issue, which you can see in the full code in the repo.

Now that we have these functions, we can scrape the data. Using
all_links <- 1:50 %>% map(get_recipe_links) %>% unlist()

one can get all the links to the recipes. Then using the following inside a for loop (after initiating "all_recipes_df")

all_recipes_df <- rbind(all_recipes_df, 

one obtains all the recipes in a data frame. I specifically used a for loop instead of something like "map_df", since I want the progress to be printed on the screen when each recipe link is connected. All these are done in the script "scrape.R" in my repo.

At the end, all the recipes are stored in a data frame "all_recipes_df" which contains lots of interesting information. Below, I will discuss very briefly a simple analysis that can be done with this data.

Exploratory Analysis

I have written a detailed markdown document that performs the analysis which can be found in my repo, and also located in Rpubs. So, I will only discuss some of the results here.

Let's look at the distribution of ratings in the website (next page)