Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2015 » May » Opinions, Interviews, Reports » Trifacta – Wrangling US Flight Data, part 2 ( 15:n17 )

Trifacta – Wrangling US Flight Data, part 2


This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.



By Lars Grammel, Trifacta.

In part 1 of this blog series, I shared how to locate and begin wrangling on-time performance data for flights by US airlines that took place in February 2015.

In this post, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:

  • Conditionally fill in missing values that can be calculated from other columns
  • Change the data type of numeric columns that represent boolean values
  • Perform an automatic lookup using another dataset
  • Enrich with airport geo-location data from DBpedia

Finally, I will also present a few results from my analysis of the prepared data in Tableau.

Filling in arrival delays and cancellation reasons for diverted flights
The data set documentation mentioned that the cancellation reason, the actual elapsed time and the arrival delay are empty for diverted flights. I quickly verify this by selecting diverted flights that did not reach their destination and filtering the data grid to show only those rows:

Trifacta Screen 1

To enable easy analysis in Tableau, I want to treat diverted flights similar to regular flights. If they did not reach their destination, they should have a cancellation code, and if they reach their destination, there should be a correct arrival delay and elapsed time.

First, I author a transform that sets the cancellation code to ‘Diverted’ for flights that were diverted, but did not reach their destination:

Trifacta Screen 2

Then I fill in the actual elapsed time using a similar approach, calculate the arrival delay for those rows and remove the diversion-related columns afterwards:

Trifacta Screen 3

Change Cancelled and Diverted into Boolean columns
The ‘Cancelled’ and ‘Diverted’ columns contain Boolean values, but are using 1.00 and 0.00 to encode them. To make sure they show up as a Boolean category in Tableau, I change their values to true/false:

Trifacta Screen 4

Looking up the airline name


Sign Up