Trifacta – Wrangling US Flight Data, part 2

This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.

Analysis in Tableau

Now that the wrangling process is complete, I pull up the transformation output in Tableau to further analyze this prepared data.

In reviewing the data, I notice flights in February that departed from the North-East and Midwest were more delayed on average than flights that departed in other parts of the continental US. I’m able to plot the different airports on a map because I have combined the air traffic data with the airport geo-locations using Trifacta.

Trifacta Screen 12 - Tableau

For airports with more than 5,000 flights in February 2015, the NYC airports (JFK & LaGuardia) have the highest average delay with more than 24 minutes. Chicago O’Hare was problematic because it has a very high volume of flights and the average delay was also more than 23 minutes.

Trifacta Screen 13 - Tableau - Airport Delays

Drilling further into O’Hare, I take a quick look at the number of flights and the average delay by carrier. Whereas Frontier Airlines has the longest average delay with almost 42 minutes, they had less than 500 flights in February 2015 at O’Hare airport. Envoy Air, on the other hand, had over 5,000 flights and an average delay of over 26 minutes.

Trifacta Screen 14 - Tableau - O'Hare Delays by Carrier

Next time you’re planning to fly out of Chicago O’Hare in February, you’ll want to make sure you schedule plenty of time for potential delays and also pick the right airline if you have options.

Wrap Up

To recap, in this blog series, I’ve analyzed February 2015 flight on-time performance data from the Bureau of Transportation Statistics. I’ve cleaned, reformatted and enriched the dataset using Trifacta and analyzed the result in Tableau.

With Trifacta, I have:

  • Cleaned up quotes in several columns
  • Removed unrelated and redundant columns
  • Removed the state part of origin and destination city names
  • Reformatted the arrival and departure timestamps
  • Replaced the cancellation code with an easy-to-understand description
  • Conditionally filled in missing values that can be calculated from other columns
  • Changed numeric columns that represent Boolean values into Boolean columns
  • Performed an automatic lookup using another dataset
  • Joined airport geo-location that I get by wrangling DBpedia data

With Tableau, I have explored various aspects of the data, including:

  • The geospatial distribution of departure delays
  • The departure delays at airports with more than 5,000 flights
  • The departure delays by carrier for Chicago O’Hare airport

Feel free to try this exercise out for yourself using the free trial of Trifacta and downloading the February 2015 on-time performance data from the Bureau of Transportation Statistics.

Happy wrangling!

Bio: Lars Grammel, @lgrammel is a consultant software engineer at Trifacta.