Web Scraping with R: Online Food Blogs Example
We consider scraping data from online food blogs to construct a data set of recipes with ingredients, nutritional information and more, and do exploratory analysis which provides tasty insights.
Users can rate a recipe from 1 to 5 stars. As can be seen, pretty much all ratings are close to 5 stars (median value is 4.8). This is not a surprise, since individual food blogs tend to have a relatively small following and the followers are those who enjoy the recipes so they rate them high. This makes modelling ratings with features from the recipes rather hard, so instead, I looked at distribution of words in all the recipes. I have made use of the tidytext and tokenizers packages, which have been really useful. After some data munging, I performed principal components analysis of words appearing in the recipes and obtained some interesting results. For example, the following plot shows the vectors of all the words projected on the first two principal components:
I find this plot rather fascinating since it captures some interesting effects:
- Ingredient vectors used in baking tend to be close to each other (milk, sugar, butter, flour etc.)
- Cheese, slice and shredded are close to each other (for obvious reasons)
- Garlic, minced, cloves, olive, oil are close to each other
- All of these groups of vectors point along different directions
This is in fact similar to what one would observe in word2vec models.
One problem with this data was the fact that more than half of the entries lacked nutritional information. One would expect a strong correlation between nutritional values and ingredients (unlike biased ratings) which could have led to a more interesting analysis.
Final Words
Online food blogs provide a great resource for data mining and and exploration. What is outlined here only scratches the surface for what can be done with this data. I have several ideas which, in my opinion, could be quite interesting to explore:
- Scraping more data from food blogs and combining with the current data set. I found that many sites use a similar JSON format for recipe data.
- Using images contained in the blogs to perform image classification (e.g. high calorie food detection)
- Using data from Food Network, AllRecipes etc. (One may need to ask for permission to use the data in these cases).
As mentioned in the beginning, if there are any other ideas, or existing personal projects, please feel free to contact me for collaboration!
Original. Reposted with permission.
Bio: Burak Himmetoglu is a Data Scientist and High Performance Computing (HPC) specialist with Ph.D. in physics. He has strong mathematical modeling, data analysis, and programming background, and is passionate about applying academic skills to solve difficult business problems and develop data products.