Trifacta – Wrangling US Flight Data, part 2
This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.
Looking up the airline name
The airline names are somewhat cryptic. Fortunately, the BTS provides a lookup table that I download and wrangle it so it can be used to enrich the data set by matching airline codes with actual airline names.
I then lookup the code column of the airline dataset against the UniqueCarrier column of the on-time dataset:
After removing the old UniqueCarrier column and renaming the new column to Carrier, the dataset contains the full carrier names:
Wrangling and joining the airport geo-locations
To enable geospatial analysis in Tableau, it would be great to have the actual airport locations, i.e. latitude and longitude. However, they are not part of the original dataset. Fortunately, almost all airports are described in Wikipedia, and there is a DBpedia table with airport data that can be downloaded as CSV.
The airport dataset took about 30 minutes to wrangle. I’ll spare you the details here, but this is how it looked like before and after:
Now I join the airport geo-locations into the original dataset. I use a left outer join so I don’t drop columns for which there are no airport geolocations and verify that the number of output rows matches the number of input rows in our on-time data table.
Then I rename latitude to originLatitude and longitude to originLongitude and repeat the same steps for the destination airport.
After all these wrangling steps, this is what our dataset looks like:
I run a Hadoop job to execute the script that I’ve constructed on the whole February 2015 dataset and export the result as a .csv file, which I load into Tableau.