KDnuggets Home » News » 2014 » Oct » Opinions, Interviews, Reports » Making Sense of Public Data – Wrangling Jeopardy – Part 2 ( 14:n28 )

Making Sense of Public Data – Wrangling Jeopardy – Part 2


Wrangling Jeopardy (Part 2) describes the remaining steps of the data transformation process, detailing how we used Trifacta to structure, clean, enrich and distill Jeopardy data for analysis.



By Alon Bartur and Will Davis (Trifacta), Oct 2014.

This blog picks up after Part 1, and describes the remaining steps of the data transformation process, detailing how we used Trifacta to structure, clean, enrich and distill Jeopardy data for analysis.

Structure

Data lacking traditional tabular structure can be difficult to initially understand. Hundreds, if not thousands of rows can display themselves as a single string of run-on characters in end-user tools. For this semi-structured data, analysts need to know the content of the data in advance to quickly uncover delimiters in the data that reveal its formatting. Rather than forcing the analyst to manually figure this out, Trifacta introspects the data and infers as many delimiter patterns as it can in advance of displaying the data, automatically and proactively recommending initial data structure. By automating this process, Trifacta speeds up an analyst’s ability to understand what’s in the data and how it may need to be further refined for analysis.

Fig 1. Initial inferred structuring of Jeopardy Contestants data set

After gaining an understanding of the data profile, we began pulling out attributes or features relevant to our analysis.

For the contestants data set, we worked through the free text bios of the contestants to split the contestant information originally provided in a single data column, into separate columns for each contestant’s name, state and occupation.

Fig 2. Example of extracting Jeopardy contestants’ occupation into a separate column

To create indicator columns not present in the original data set, we also prepared the data for aggregations and calculations by leveraging the columns on show number and contestant final score. This is often called variable reshaping and will help us mathematically identify whether a contestant was a winner or loser of their particular episode.

Fig 3. Jeopardy contestants data set (partial view) after the structuring process has been completed

In Trifacta, it’s important to note that the majority of this structuring is done through interactions directly on the content of the data itself (such as clicking and dragging) and selecting the appropriate recommended transformation for that particular segment of the data.

Clean

Since the Jeopardy questions data set originated from someone’s scraping efforts, there were some data quality issues that we needed to clean up. As shown in the screenshot below, we were able to use Trifacta to remove elements that aren’t accepted by downstream analytics tools such as formatting inconsistencies for numbers and names. This type of data cleansing is not only commonly required for data that may have been gathered publicly, but also often required for tabular data due to mistakes made during data entry or additional values erroneously created during the data processing pipeline in large organizations.

The scraping process pulled in html tags for links on the pages that we scraped. To clean up these tags, we used a combination of highlighting data elements visually and accepting Trifacta recommendations to build a transform to identify and remove any text elements in that column that were enclosed by tags. The (partial) screenshot below shows how this was performed in Trifacta.

Fig 4. Cleaning Jeopardy Questions data set by removing html characters

Enrich

The preparation process of the two data sets for this Jeopardy analysis required blending the two data sets to start answering questions. Trifacta’s interface allows you to perform this blending fairly seamlessly within the interface. You simply open the Jeopardy contestants data set and connect it to the Jeopardy questions data set.



Fig 5. Enriching the data by blending the Jeopardy questions & Jeopardy contestants data sets

Once you determine the appropriate columns to match on and which columns to include in the final output data set, you’ve completed enriching the Jeopardy contestants data set with additional attributes or features of each show from the Jeopardy questions data set.

Distill

Depending upon the type of analysis you’re performing, a data set can be prepared a variety of different ways to answer slightly different business questions. Whether it’s performing a lookup on a certain element in the data, creating indicator columns or blending different data sets, the business case for the data will determine the preparation process of the data.

At the end of the data preparation process, typically you want to share your insights with a broader audience. This is where data analysis shifts into telling a story about your data. Data visualization technologies from companies like Tableau have become very robust platforms for sharing data stories across a wide audience of users. However, they often have constraints on the size and shape of data that they can accept and often prefer tabular structured formats. This is where Trifacta is able to transform data for analysis and filter the data down to a size that your data visualization tool of choice can support. Trifacta has optimized integration with tools like Excel and Tableau by delivering data in output formats native to those platforms.

At Trifacta, we’re solving the bottleneck between organizations and discovering insights from data by facilitating the process of data transformation. The Trifacta Data Transformation Platform is enabling data analysts, data scientists and anyone who regularly works with data to increase their productivity when transforming data for analysis.

Bio: Alon Bartur brings a wealth of field experience to Trifacta's product management team with his experience in product management, alliances and sales engineering. Prior to joining Trifacta, Alon worked at both GoodData and Google. As Product Manager at Trifacta, Alon works closely with Trifacta customers and partners to drive the product roadmap and requirements for Trifacta's Data Transformation Platform.




Bio: Will Davis drives both content and product marketing efforts at Trifacta having spent the past five years managing the marketing initiatives for several high-growth data companies. Prior to Trifacta, Will worked with a variety of companies focused on data infrastructure, analytics and visualization, including GoodData, Greenplum and ClearStory Data. Will develops and executes Trifacta’s marketing and content strategies to rapidly expand business growth and brand awareness.


Related:

Sign Up