KDnuggets Home » News » 2017 » Feb » Tutorials, Overviews » Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data ( 17:n06 )

Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data


 
 
http likes 39

This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.



By Jean-Nicholas Hould, JeanNicholasHould.com.

Editor's note: This post is the second in a 2 part series. Read the first part here.

Tidy up

 
In the scraping part, I didn’t bother to clean up the data. There are a few reasons for this. First, pandas is my tool of choice to manipulate the data. Secondly, I wanted to separate the concerns: scraping and cleaning.

In a previous post, I explained the concept of “tidy data”. This concept was initially presented by Hadley Wickham who describes a tidy dataset as one having the following attributes:

  • Each variable forms a column and contains values
  • Each observation forms a row
  • Each type of observational unit forms a table

If you take a moment to look at the dataset we scraped so far, you will notice it’s messy. It does not conform to the properties of a “tidy dataset”. You will notice there are multiple observational units in the same table (beers and breweries), there are multiple variables in a single column (city and state), etc. In this section of the post, we’ll clean up the dataset.

abv brewery_location brewery_name ibus id name size style
4.5% Minneapolis, MN NorthGate Brewing 50 2692 Get Together 16 oz. American IPA
4.9% Minneapolis, MN NorthGate Brewing 26 2691 Maggie's Leap 16 oz. Milk / Sweet Stout
4.8% Minneapolis, MN NorthGate Brewing 19 2690 Wall's End 16 oz. English Brown Ale
6.0% Minneapolis, MN NorthGate Brewing 38 2689 Pumpion 16 oz. Pumpkin Ale
... ... ... ... ... ... ... ...


Beers & Breweries: Two Observational Units

 
The most flagrant problem with the scraped dataset is that there are multiple observation units in the table: beers and breweries. The solution for this is to create two separate dataset.

Breweries

The breweries dataset is straightforward to create. We simply select the relevant columns in our scraped dataset, remove the duplicates and then reset the index. The reason we reset the index is to assign a unique identifier to each brewery. This will be useful when we’ll create the beer dataset because we’ll be able to associate each beer with a `brewery_id.

brewery_location brewery_name id
0 Minneapolis, MN NorthGate Brewing 0
1 Louisville, KY Against the Grain Brewery 1
2 Framingham, MA Jack's Abby Craft Lagers 2
3 San Diego, CA Mike Hess Brewing Company 3
4 San Francisco, CA Fort Point Beer Company 4


Beers

For the beers dataset, we want to replace all of the repetitive brewery information, such as the location and name, with a brewery_id. Previously, we created a dataset of breweries where each brewery has a unique id. It’s now time to use this identifier.

To match this brewery_id with the appropriate beer, we use the merge function in pandas. This function is similar to a join in SQL. In this function, we pass both the scraped dataset and the breweries dataset as function parameters. We also specify that we want to join both of these datasets on the brewery_name and brewery_location. This will result in a new DataFrame.

abv ibu id name style brewery_id ounces
0 0.050 NaN 1436 Pub Beer American Pale Lager 408 12.0
1 0.066 NaN 2265 Devil's Cup American Pale Ale (APA) 177 12.0
2 0.071 NaN 2264 Rise of the Phoenix American IPA 177 12.0
3 0.090 NaN 2263 Sinister American Double / Imperial IPA 177 12.0
4 0.075 NaN 2262 Sex and Candy American IPA 177 12.0


City and State: Two Variables

 
In the breweries dataset, the brewery_location column contains two variables: the city and the state in which the brewery is located in. To respect the tidy principles, we want to isolate each variable in a single column.

Technically, separating those two variables can be accomplished using the splitfunction. This function will split the string it’s called on every time it encounters the delimiter passed as a parameter. In this case, our delimiter is a comma. The function will split the string every time it encounters this comma. The value returned by the function is a list of strings.

To run the split function on each value of the brewery_location Series, we use the apply function. This function will repetitively run code on each value of a Series. In this case, the code we run is a lamba. This anonymous function splits the brweery_location string and returns the appropriate value, the city or state, in the list of returned strings.

name city state id
0 NorthGate Brewing Minneapolis MN 0
1 Against the Grain Brewery Louisville KY 1
2 Jack's Abby Craft Lagers Framingham MA 2
3 Mike Hess Brewing Company San Diego CA 3
4 Fort Point Beer Company San Francisco CA 4


A few examples of strings to numerical values

 
The abv and ibu should be numerical values but they contain some letters. The abv column, which represents the alcohol by volume, is stored as a string with a “%” instead of a floating number between 0 and 1. Secondly, the ibu values are stored as strings while it should be integers. Some values are numbers while some others are a string “N/A”.

To fix this, I’ll apply some custom functions I created to clean up the data. You’ll notice that I am using try and except. The reason I am doing this is that I know my code will fail when it will encounter a string with letters. Therefore, if it fails due to ValueError, I return None. In the DataFrame, None will be transformed to NaNwhich indicates missing data.

Size to Ounces: Strip out the measurement unit

 
Another problem with our dataset is that the size column contains a mix of numerical values and letters. The letters are used to indicate the measurement unit.

As I skimmed through the size column, I noticed the way the measurement unit was written was not consistent. Sometimes ounce was spelled oz., ounce, OZ., etc. In order to list all of the possible values we could be dealing with, I transformed the size column into a set and printed it. A set is a list that contains only unique values.

Now that we know the set of possible values we can be dealing with, we can write some code to parse the string and transform it into a numerical value. To do this, I used a regular expression that matches the first number encountered.

abv ibu id name style brewery_id ounces
0 0.050 N/A 1436 Pub Beer American Pale Lager 408 12.0
1 0.066 N/A 2265 Devil's Cup American Pale Ale (APA) 177 12.0
2 0.071 N/A 2264 Rise of the Phoenix American IPA 177 12.0
3 0.090 N/A 2263 Sinister American Double / Imperial IPA 177 12.0
4 0.075 N/A 2262 Sex and Candy American IPA 177 12.0


Ready for exploration

 
We now have two clean datasets: beers and breweries. Each column has been cleaned up to respect the tidy principles. The code and the data for this post is available here on Github. We are now ready to start exploring those datasets.

Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.

Original. Reposted with permission.

Related: