The Data Science Process, Rediscovered

The Data Science Process is a relatively new framework for doing data science. It is compared to previous similar frameworks, and a discussion on process innovation versus repetition is then undertaken.

Last week, KDnuggets top tweet was a Quora answer to What is the work flow or process of a data scientist?. This answer, written by Ryan Fox Squire, a self-described "Neuroscientist Turned Data Scientist," employed The Data Science Process as it described such a workflow.

The Data Science Process

The Data Science Process is a framework for approaching data science tasks, and is crafted by Joe Blitzstein and Hanspeter Pfister of Harvard's CS 109. The goal of CS 109, as per Blitzstein himself, is to introduce students to the overall process of data science investigation, a goal which should provide some insight into the framework itself.

Data science process

The following is a sample application of Blitzstein & Pfister's framework, regarding skills and tools at each stage, as given by Ryan Fox Squire in his answer:

Stage 1: Ask A Question

  • Skills: science, domain expertise, curiosity
  • Tools: your brain, talking to experts, experience

Stage 2: Get the Data

  • Skills: web scraping, data cleaning, querying databases, CS stuff
  • Tools: python, pandas

Stage 3: Explore the Data

  • Skills: Get to know data, develop hypotheses, patterns? anomalies?
  • Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data

  • Skills: regression, machine learning, validation, big data
  • Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

  • Skills: presentation, speaking, visuals, writing
  • Tools: matplotlib, adobe illustrator, powerpoint/keynote

Squire then (rightfully) concludes that the data science work flow is a non-linear, iterative process, and that there are many skills and tools required to cover the full data science process. Squire also professes that he is fond of the Data Science Process as it stresses both the importance of asking questions to guide your workflow, and the importance of iterating on your questions and research, as one gains familiarity with one's data.

The Data Science Framework is an innovative framework for approaching data science problems. Isn't it?

Next, we look at CRISP-DM.