Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2015 » May » Opinions, Interviews, Reports » Trifacta – Wrangling US Flight Data

Trifacta – Wrangling US Flight Data

A useful case study shows how Trifacta can clean and analyze US Flight data, including cleaning up markup, removing unrelated and redundant columns, cleaning geographic names and more.

Reformatting timestamps

Exploring the data further using a combination of column details and column browser, I find that the scheduled departure and arrival time columns are using the format ‘hhmm’:

The interpretation of a timestamp as a number can lead to analysis problems like the gaps in the histogram above. I want to change this into ‘hh:mm’ to enable easy recognition in downstream tools. I perform a long-click between the 2nd and the 3rd character of an example scheduled departure time value:

Then I click the headers of the two new columns to get a merge transform suggestion. The merged column is shown at the right side of the data grid.
Next, I modify this transform in the editor to insert the colon and take a look the merged column in the column details. Trifacta now interprets the column as a time and shows hour-of-day and minute-of-hour patterns.
After dropping the source columns and renaming the merged column, I repeat the same steps for the arrival time.

Setting the cancellation reason

The cryptic cancellation code is somewhat unintuitive. I lookup the meaning of cancellation code on the BTS website. Since there are only four different codes, I write a few ‘set’ transforms to use a descriptive name instead of a code, and then rename to column to ‘CancellationReason’.
The data already looks pretty good now. So far, I have
  • Cleaned up quotes in several columns
  • Removed unrelated and redundant columns
  • Removed the state part of origin and destination city names
  • Reformatted the arrival and departure timestamps
  • Replaced the cancellation code with an easy-to-understand description

In the second part of the series, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:
  • Conditionally fill in missing values that can be calculated from other columns
  • Change numeric columns that represent boolean values into boolean columns
  • Perform an automatic lookup using another dataset
  • Join airport geo-location that I get by wrangling DBpedia data

Finally, I will also present a few results from my analysis of the prepared data in Tableau. Stay tuned.

Bio: Lars Grammel, @lgrammel is a consultant software engineer at Trifacta.


By subscribing you accept KDnuggets Privacy Policy