Trifacta – Wrangling US Flight Data
A useful case study shows how Trifacta can clean and analyze US Flight data, including cleaning up markup, removing unrelated and redundant columns, cleaning geographic names and more.
In the wake of the recent story on how Qantas Airlines was at risk of losing valuable landing slots at London’s busy Heathrow airport, I became interested in examining flight data across the US to see if I could determine what airlines or airports had the most delays.
Finding air-traffic data turns out to be easy. The Bureau of Transportation Statistics (BTS) provides the on-time performance data of US airlines dating back to 1987. For the purposes of this analysis, I decided to focus only on recent performance, so I downloaded the files for the flights taking place in February 2015.
You can follow along yourself by downloading the files here and requesting access to a trial account of Trifacta.
I now want to get an overview of the on-line performance data and to re-structure it for my analysis. In this first blog post, I will walk through how I used Trifacta to prepare the data in the following way:
- Clean up markup in data such as quotes
- Remove unrelated and redundant columns
- Remove the state part of city names
- Reformat timestamps
- Perform a manual lookup using the set transform
In a second post, I will enrich and further clean the data with Trifacta and then present a few results from my analysis of the prepared data in Tableau.Upon registering the data within Trifacta, the system quickly examines the structure of the file and automatically infers how to split the data set into rows and columns and promote the header row of the file.
Cleaning Up Quotations
Following the initial transformation steps, I notice that there were a few columns that had quotes around the text. By selecting the quote character, Trifacta generates transform suggestions to clean up the quotes, including the one I want to use to remove quotations from all of the columns:
Removing unrelated and redundant columns
This flight performance dataset is incredibly wide with over 111 columns in the fill. Instead of dealing with all of these columns, I quickly remove the ones that are not relevant for my analysis. To do this, I use Trifacta column browser and column details views to explore the different columns, and then select columns that I want to remove. For example, the data contains a FlightDate column, making the year, month etc. column redundant:
Similarly, I remove other redundant and unrelated columns after inspecting them quickly using the column browser and the column details features. At the end of this process, I have reduced the number of columns to 32, making the whole data set more manageable.
Removing the state from city names
Inspecting the OriginCityName columns shows that the state name is included in the city name. Below the city name, there is also a origin state column that I quickly open in the column details view.