KDnuggets Home » News » 2014 » Oct » Opinions, Interviews, Reports » Making Sense of Public Data – Wrangling Jeopardy ( 14:n26 )

Making Sense of Public Data – Wrangling Jeopardy


Trifacta’s Alon Bartur & Will Davis detail their process for transforming or “wrangling” publicly available Jeopardy data found on the web for downstream analysis.



By Alon Bartur and Will Davis, Oct 2014.

Ever have dreams of achieving Jeopardy glory?

JeopardyIf so, you might be interested in knowing where the majority of Jeopardy contestants live. Or what the average winning score per episode has been each year. Taking a look at the detailed history of Jeopardy questions and contestants could provide you with interesting stats such as:

  • The vast majority of Jeopardy contestants come from California (9.4% of total contestants), with the majority of these (5.7% of total contestants) clustered around Los Angeles near the Jeopardy studios
  • The average winning Jeopardy score has been trending upwards over the years at a rate of 68% a year. The largest inflection point occurred in November of 2001 when the show doubled the minimum question price from $100 to $200

Jeopardy Contestants, Geographically and over time One might think that this information lives in some secret database hidden deep within Jeopardy’s headquarters. Thankfully it’s not, it just takes some web scraping and clever “data wrangling” to make sense of it all.

Finding the data is one thing; transforming that data into something useful for analysis is the “data wrangling” piece of the equation that typically ends up taking the most time. This is where things get bogged down and sometimes where analysts give up because the process of preparing data can be painful and slow. Below is a brief summary of the Trifacta’s Data Transformation Platform process used to effectively prepare raw Jeopardy data into a format ready for analysis:

  • Acquire
  • Discover & Assess
  • Structure
  • Clean
  • Enrich
  • Distill

This same framework can be applied to any common data preparation scenario within an organization such as transforming customer records, raw web logs or network traffic for analysis. Keep in mind, the different elements of this process are typically not sequential and often require the practitioner to seamlessly move back and forth between different segments of the process to reach the desired outcome.

Acquire

For the purposes of this experiment, we first found a publicly available dataset on Jeopardy Questions to start working with. After assessing the content of this Jeopardy Questions dataset and its potential for analysis, we opted to gather some additional data on Jeopardy to expand the breadth of the analysis. To do this, we discovered the site J! Archive, that contains information on every Jeopardy episode of all time. We extracted data on Jeopardy Contestants using a web-scraping tool called Import.io.

Discover & Assess

Once we collected the right data, we needed to figure out what was in each dataset to determine how we could leverage for analysis. Initially, this involved reaching an understanding for the structure & content of the data:

Structure:
  • The Jeopardy Questions data set came to us as a large JSON record
  • The Jeopardy Contestants data set was obtained by scraping j-archive.com with import.io. The output came in as a flat file but contained a ton of information that was present in unstructured player bios. Additionally, each record contained three sets of columns for each of the contestants that needed to be reshaped in order to arrive at our final goal of one record per contestant.

Content:
  • The Jeopardy Questions data set included information on every question and answer asked in the history of Jeopardy, the air date, dollar amount, game number and round.
  • The Jeopardy Contestants data set was formatted as free text contestant biographies containing information on where the contestants were from, what their final scores were, their occupations, their school (if it was college Jeopardy),etc.

Quickly understanding both the technical data profile (the structure) and the functional data profile (the content), has traditionally been a tall order for most analysis tools. Regardless of size, this is a critical step to finding insights in any data,.

So how does this work in practice?

Here’s how we used Trifacta’s Data Transformation Platform to rapidly gain an understanding of what’s in each of these Jeopardy datasets.

(Raw JSON of Jeopardy Questions dataset) Raw JSON of Jeopardy Questions

Jeopardy Questions:

When we open the Jeopardy Questions JSON file in Trifacta, we are immediately presented with some useful pieces of information. We’re shown an inferred set of steps that the tool has taken on it’s own to help get the data into a form that we can better understand. With a JSON input file, Trifacta un-nests the information to display in a grid so the user can easily interact with it.

Trifacta also presents histograms for each column that give a good sense for the data’s distribution, inferred data type and quality bars that display how much of our data is incorrectly categorized or missing. This information allows for analysts to quickly focus structuring, cleaning and enriching efforts.

(Trifacta script suggestions/histograms/quality bar for Jeopardy Questions dataset)

Trifacta histogram for Jeopardy Questions
Jeopardy Contestants

The contestants data set was scraped from the j-archive website. When we opened it in Trifacta, the platform inferred a line and column break for the file and gave us the same set of summary statistics. After inspecting the data in Trifacta, we could see that each row contained free text biographies for each Jeopardy contestant in the referenced show. It also contained separate columns containing their final scores that were not connected to the individual players. This quickly let us know that we’d have to carry out some reshaping and text extraction in our subsequent steps before we’d be able perform any meaningful analysis on the data.

In Part 2 of this KDnuggets blog series, we will walk through the remaining steps of the data transformation process, detailing how we used Trifacta to structure, clean, enrich and distill this Jeopardy data for analysis.

Bios: Alon Bartur brings a wealth of field experience to Trifacta's product management team with his experience in product management, alliances and sales engineering. Prior to joining Trifacta, Alon worked at both GoodData and Google. As Product Manager at Trifacta, Alon works closely with Trifacta customers and partners to drive the product roadmap and requirements for Trifacta's Data Transformation Platform.

Will Davis drives both content and product marketing efforts at Trifacta having spent the past five years managing the marketing initiatives for several high-growth data companies. Prior to Trifacta, Will worked with a variety of companies focused on data infrastructure, analytics and visualization, including GoodData, Greenplum and ClearStory Data. Will develops and executes Trifacta’s marketing and content strategies to rapidly expand business growth and brand awareness.

Related: