The Data Science Process, Rediscovered

The Data Science Process is a relatively new framework for doing data science. It is compared to previous similar frameworks, and a discussion on process innovation versus repetition is then undertaken.


As a comparison to the Data Science Process put forth by Blitzstein & Pfister, and elaborated upon by Squire, we take a quick look at the de facto official (yet unquestionably falling out of fashion) data mining framework (which has been extended to data science problems), the Cross Industry Standard Process for Data Mining (CRISP-DM). Though the standard is no longer actively maintained, it remains a popular framework for navigating data science projects.


CRISP-DM is made up of the following steps:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

You can see similarities in these models: we start by asking a question or looking for insight into some particular phenomenon, we need some data to examine, the data must be inspected or prepared in some manner, the data is used to create some appropriate model, and something is done with the resulting model, be it "deployed" or "communicated." Though not quite to the extent of Blitzstein & Pfister's Data Science Process, CRISP-DM's workflow allows for iterative problem solving, and is clearly nonlinear.

Just as the standard itself is no longer maintained, neither is its website. You can, however, access further information about CRISP-DM on its Wikipedia page. For those unfamiliar with CRISP-DM, this visual guide is a good place to begin.

So CRISP-DM is clearly the base framework for investigating data science problems. Right?

KDD Process

Around the same time that CRISP-DM was emerging, the KDD Process had finished developing. The KDD (Knowledge Discovery in Databases) Process, by Fayyad, Piatetsky-Shapiro, and Smyth, is a framework which has, at its core, "the application of specific data-mining methods for pattern discovery and extraction." The framework consists of the following steps:

  • Selection
  • Preprocessing
  • Transformation
  • Data Mining
  • Interpretation

If you consider the term "data mining" analogous to the term "modeling" in the previous frameworks, the KDD Process lines up similarly. Note the iterative nature of this model as well.

KDD Process


It is important to note that these are not the only frameworks in this space; SEMMA (for Sample, Explore, Modify, Model and Assess), from SAS, and the agile-oriented Guerilla Analytics both come to mind. There are also numerous in-house processes that various data science teams and individuals no doubt employ across any number of companies and industries in which data scientists work.

So, is the Data Science Process a new take on CRISP-DM, which is just a reworking of KDD, or is it a new, independent framework in its own right? Well, yes. And no.

Just as data science can be viewed as a contemporary take on data mining, the Data Science Process and CRISP-DM may be viewed as updates to the KDD process. To be clear, however, even if this is the case, is does not render them unnecessary; the updates to their process presentations may be of benefit to newer generations approaching these processes from both the point of view of refreshed and up-to-date language, as well as the presentation of a framework which can be viewed as "new" and, thus, worthy of attention.

Is every JavaScript library warranted? I'm no expert in the space, but I would say probably not. Sure, it's not a perfect analogy, but the underlying point is that in technology, there are often overlaps in tools being employed. People are attracted to shiny new things, and to different things, and as such, newly packaged terminology can serve both a psychological and practical purpose, even if it happened to be exactly, or relatively, the same as one which came before it.

The Data Science Process and its predecessor CRISP-DM are basically re-workings of the KDD Process. And this is not meant with malice or dark undertone; it has not been typed in the accusatory or with wagging finger. This is simply a statement of a simple fact: that which comes before influences that which comes after. In the end, any framework or process or series of steps which we take to do data science, as long as it works for us and provides accurate results, is worthy of being used. Even if this happens to be the Data Science Process, or CRISP-DM, or the KDD Process, or whatever steps you take when you enter a Kaggle competition, or your boss asks you to cluster some data on widgets, or you try your hand at the latest deep learning research paper.