Wrangling Public Bike Share Data with The Free Trial of Trifacta

A free trial of Trifacta is a good opportunity for data analysts to start wrangle the different shapes and sizes of data sets. We give an example of wrangling Bay Area Bike Share data to better understand biking around San Francisco.

By Wei Zheng (VP Products, Trifacta).

President Obama recently appointed former Trifacta advisor DJ Patil as the nation’s first Chief Data Scientist. As part of his initial charter, Patil has called upon all data scientists and analysts to help wrangle and analyze publicly available data sets, such as the ones posted on data.gov, with the goal of enabling governments, organizations and individuals to use this data to better understand and serve our communities.

Given that we recently announced the free trial of Trifacta, it’s a great time for analysts around the world to start wrangling the different shapes and sizes of these public data sets for analysis.

In the spirit of Patil’s call to action, I went ahead and began working on one of these public data sets using the Trifacta trial to demonstrate some of the powerful transformation features we’ve built to make data preparation a much more productive process. So don’t be shy and give Trifacta a try today!

The Data
For this example, I worked with some of the freely available Bike Share data in my local community. The data is provided by Bay Area Bike Share(BABS). BABS is the region’s largest bike sharing system with 700 bikes and 70 stations across the region, intended to provide Bay Area residents and visitors with an additional transportation option for getting around, while improving air quality for the surrounding areas.

Analysis Goal
By examining the most popular bike routes and stations, I would love to gain insights to better help BABS plan its bike inventory and rotation of bikes across San Francisco.

To get the data ready for analysis, I will load the raw trip data into Trifacta.

trifacta screen1 Trifacta breaks the file into rows and columns automatically, and infers header metadata from the raw CSV file. Using type inference and visual histograms, I can get a good overview of the various attributes contained in the bike trips data.

In order to analyze bike share data for just that of San Francisco, I can use the zip code column to help me filter. Unfortunately the data isn’t 100% clean, as indicated by the data quality bar for the Zip_code column. Clicking into the red and grey areas of the quality bar allows me to see the extent of the inconsistent data.

Trifacta Zip Code

After browsing the data available to me on the BABS website, I see there is a separate bike station data file, maybe I can use it to augment and enrich my trip data with data on each station. Upon loading the bike station data into Trifacta, I can see it includes standardized city and zip code information for each one of the bike stations.

trifacta screen2 Using a simple attribute lookup transformation on the start_terminal column, I can now cleanse and standardize the zip code as well as enrich my trips data with city and other geo information.

trifacta screen3

By highlighting “San Francisco” in the visual histogram of the augmented data set, Trifacta uses its unique Predictive Interaction™ technology to provide transform suggestions that makes it very easy to filter to just the data that I care about.

trifacta screen4 Selecting the top suggestion and adding it to script allows me to filter the trips data to only trips that originated in San Francisco.

Now I am interested in seeing which stations and routes are the most popular and the corresponding bike inventory for those popular stations. For this, I can use Trifacta’s Visual Data Profiling to help me. By comparing the Start_Station attribute against the End_Station attribute, I can see an interactive heat map that allows me to further explore my data. For example, highlighting the most common starting station “San Francisco Caltrain (Townsend at 4th)” shows me that the most common ending station is “Howard at 2nd.”

trifacta screen5 I can do this more systematically by aggregating via start and stop stations. Trifacta provides a visual aggregator tool to make this easy.

trifacta screen6 Now that I have identified the most popular routes, let’s map them to the bike inventory data to see which stations (if any) need additional bike inventory to satisfy the riders. Since the bike inventory data is a rather large file (over 18M rows), Trifacta allows me to intelligently sample a subset of the data based on station IDs.

Then, I am able to compute the average bike inventory for each one of the stations located in San Francisco. trifacta-screen7

Using Trifacta join key discovery, I am able to quickly join the station inventory data with my prepared trip data ordered by the most popular routes. As shown in the screenshot below, Trifacta recognizes that “Start_Terminal” and “station_id” have similar content characteristics and recommends that I use those attributes as my join key.

Now that I have gathered what I am looking for, I can kick off a job that will use the script I just created to transform the entire data set. Upon completion, I am able to export the results of this transformation into any number of visualization or analytics tools such as Tableau. As shown below, the job results page lets me choose my export options and also provides detailed stats over entirety of the results.
This is just one of many fun and useful transformation exercises that can be done using public data and the FREE Trifacta trial. If you are interested in trying it for yourself, you can sign up at: www.trifacta.com/trial

Wei Zheng Wei Zheng is VP of Products at Trifacta, where she combines her passion for technology with experience in Enterprise Software to define and shape Trifacta product offerings. Having founded several startups of her own, Wei believes strongly in innovative technology that solves real world business problems. Most recently, she led product management efforts at Informatica, where she helped launch several new solutions including its Hadoop and data virtualization products.