Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data
This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.
By Jean-Nicholas Hould, JeanNicholasHould.com.
Editor's note: This post is the second in a 2 part series. Read the first part here.
In the scraping part, I didn’t bother to clean up the data. There are a few reasons for this. First,
pandas is my tool of choice to manipulate the data. Secondly, I wanted to separate the concerns: scraping and cleaning.
In a previous post, I explained the concept of “tidy data”. This concept was initially presented by Hadley Wickham who describes a tidy dataset as one having the following attributes:
- Each variable forms a column and contains values
- Each observation forms a row
- Each type of observational unit forms a table
If you take a moment to look at the dataset we scraped so far, you will notice it’s messy. It does not conform to the properties of a “tidy dataset”. You will notice there are multiple observational units in the same table (beers and breweries), there are multiple variables in a single column (city and state), etc. In this section of the post, we’ll clean up the dataset.
|4.5%||Minneapolis, MN||NorthGate Brewing||50||2692||Get Together||16 oz.||American IPA|
|4.9%||Minneapolis, MN||NorthGate Brewing||26||2691||Maggie's Leap||16 oz.||Milk / Sweet Stout|
|4.8%||Minneapolis, MN||NorthGate Brewing||19||2690||Wall's End||16 oz.||English Brown Ale|
|6.0%||Minneapolis, MN||NorthGate Brewing||38||2689||Pumpion||16 oz.||Pumpkin Ale|
Beers & Breweries: Two Observational Units
The most flagrant problem with the scraped dataset is that there are multiple observation units in the table: beers and breweries. The solution for this is to create two separate dataset.
breweries dataset is straightforward to create. We simply select the relevant columns in our scraped dataset, remove the duplicates and then reset the index. The reason we reset the index is to assign a unique identifier to each brewery. This will be useful when we’ll create the beer dataset because we’ll be able to associate each beer with a `brewery_id.
|0||Minneapolis, MN||NorthGate Brewing||0|
|1||Louisville, KY||Against the Grain Brewery||1|
|2||Framingham, MA||Jack's Abby Craft Lagers||2|
|3||San Diego, CA||Mike Hess Brewing Company||3|
|4||San Francisco, CA||Fort Point Beer Company||4|
beers dataset, we want to replace all of the repetitive brewery information, such as the location and name, with a
brewery_id. Previously, we created a dataset of
breweries where each brewery has a unique
id. It’s now time to use this identifier.
To match this
brewery_id with the appropriate beer, we use the
merge function in
pandas. This function is similar to a
join in SQL. In this function, we pass both the scraped dataset and the
breweries dataset as function parameters. We also specify that we want to join both of these datasets on the
brewery_location. This will result in a new DataFrame.
|0||0.050||NaN||1436||Pub Beer||American Pale Lager||408||12.0|
|1||0.066||NaN||2265||Devil's Cup||American Pale Ale (APA)||177||12.0|
|2||0.071||NaN||2264||Rise of the Phoenix||American IPA||177||12.0|
|3||0.090||NaN||2263||Sinister||American Double / Imperial IPA||177||12.0|
|4||0.075||NaN||2262||Sex and Candy||American IPA||177||12.0|
City and State: Two Variables
breweries dataset, the
brewery_location column contains two variables: the
city and the
state in which the brewery is located in. To respect the tidy principles, we want to isolate each variable in a single column.
Technically, separating those two variables can be accomplished using the
splitfunction. This function will split the string it’s called on every time it encounters the delimiter passed as a parameter. In this case, our delimiter is a comma. The function will split the string every time it encounters this comma. The value returned by the function is a list of strings.
To run the
split function on each value of the
brewery_location Series, we use the
apply function. This function will repetitively run code on each value of a Series. In this case, the code we run is a lamba. This anonymous function splits the
brweery_location string and returns the appropriate value, the
state, in the list of returned strings.
|1||Against the Grain Brewery||Louisville||KY||1|
|2||Jack's Abby Craft Lagers||Framingham||MA||2|
|3||Mike Hess Brewing Company||San Diego||CA||3|
|4||Fort Point Beer Company||San Francisco||CA||4|
A few examples of strings to numerical values
ibu should be numerical values but they contain some letters. The
abv column, which represents the alcohol by volume, is stored as a string with a “%” instead of a floating number between 0 and 1. Secondly, the
ibu values are stored as strings while it should be integers. Some values are numbers while some others are a string “N/A”.
To fix this, I’ll
apply some custom functions I created to clean up the data. You’ll notice that I am using
except. The reason I am doing this is that I know my code will fail when it will encounter a string with letters. Therefore, if it fails due to
ValueError, I return
None. In the DataFrame,
None will be transformed to
NaNwhich indicates missing data.
Size to Ounces: Strip out the measurement unit
Another problem with our dataset is that the
size column contains a mix of numerical values and letters. The letters are used to indicate the measurement unit.
As I skimmed through the
size column, I noticed the way the measurement unit was written was not consistent. Sometimes ounce was spelled
OZ., etc. In order to list all of the possible values we could be dealing with, I transformed the
size column into a
set and printed it. A
set is a list that contains only unique values.
Now that we know the set of possible values we can be dealing with, we can write some code to parse the string and transform it into a numerical value. To do this, I used a regular expression that matches the first number encountered.
|0||0.050||N/A||1436||Pub Beer||American Pale Lager||408||12.0|
|1||0.066||N/A||2265||Devil's Cup||American Pale Ale (APA)||177||12.0|
|2||0.071||N/A||2264||Rise of the Phoenix||American IPA||177||12.0|
|3||0.090||N/A||2263||Sinister||American Double / Imperial IPA||177||12.0|
|4||0.075||N/A||2262||Sex and Candy||American IPA||177||12.0|
Ready for exploration
We now have two clean datasets:
breweries. Each column has been cleaned up to respect the tidy principles. The code and the data for this post is available here on Github. We are now ready to start exploring those datasets.
Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.
Original. Reposted with permission.