Trifacta – Wrangling US Flight Data, part 2
This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.
By Lars Grammel, Trifacta.
In this post, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:
- Conditionally fill in missing values that can be calculated from other columns
- Change the data type of numeric columns that represent boolean values
- Perform an automatic lookup using another dataset
- Enrich with airport geo-location data from DBpedia
Finally, I will also present a few results from my analysis of the prepared data in Tableau.
Filling in arrival delays and cancellation reasons for diverted flights
The data set documentation mentioned that the cancellation reason, the actual elapsed time and the arrival delay are empty for diverted flights. I quickly verify this by selecting diverted flights that did not reach their destination and filtering the data grid to show only those rows:
To enable easy analysis in Tableau, I want to treat diverted flights similar to regular flights. If they did not reach their destination, they should have a cancellation code, and if they reach their destination, there should be a correct arrival delay and elapsed time.
First, I author a transform that sets the cancellation code to ‘Diverted’ for flights that were diverted, but did not reach their destination:
Then I fill in the actual elapsed time using a similar approach, calculate the arrival delay for those rows and remove the diversion-related columns afterwards:
Change Cancelled and Diverted into Boolean columns
The ‘Cancelled’ and ‘Diverted’ columns contain Boolean values, but are using 1.00 and 0.00 to encode them. To make sure they show up as a Boolean category in Tableau, I change their values to true/false:
Looking up the airline name