Fantastic Four of Data Science Project Preparation
This article takes a closer look at the four fantastic things we should keep in mind when approaching every new data science project.
In classic comic form, the Fantastic Four of Data Science Project Preparation were originally introduced via cameo in the Data Avengers... Assemble! article. So if you happen to have that post, hang on to it. It will undoubtedly be worth something some day.
Moving right along...
In Marvel comics, the Fantastic Four are a 4 person collective, the members of which gained superpowers when exposed to cosmic rays while on a space flight. Reed Richards aka Mr. Fantastic can stretch, reshape, and contort his body in inhuman ways, and is the group leader. Susan Storm aka Invisible Woman is Reed Richards' wife who possesses the ability to become invisible. Johnny Storm aka Human Torch is Sue Storm's brother who has the mastery of fire, which allows him to engulf his body in flame, and also has the power of flight for some reason. Ben Grimm aka The Thing is Reed Richards' best friend who has been transformed by the cosmic rays into a rock-like monster with superhuman strengh.
As you can see — and as is with most superheroes — it really is a family affair.
For our purposes, the Fantastic Four of data science include these vital aspects of approaching each project:
- Understand the problem domain and questions being asked
- Investigate the data
- Clean, prepare, and transform the data as required
- Approach problem solving from within a well-defined framework
Just as Marvel's Fantastic Four have devoted themselves to helping others, our four help data scientists approach their projects with success in mind.
Let's take a closer look at the four fantastic things we should keep in mind when approaching every new data science project.
1. Understand the problem domain and questions being asked
The first fantastic thing of data science project preparation is understanding the problem domain. Having sufficient domain knowledge is crucial to success.
What exactly constitutes sufficient domain knowledge? It's relative. Are you doing a shallow descriptive analysis of some run-of-the-mill dating app? Or are you undertaking some in-depth predictive analytics project in finance for an organization which specializes in some obscure securities investment strategy? The required knowledge of the "dating" domain to perform the first feat is likely negligible, but any useful insights into the second will certainly demand a solid financial understanding.
Context-dependent minimal levels of expertise aside, you also need to understand the questions being asked. This is separate from understanding the domain. You may be a bona fide financial system whiz, but if you don't have a handle on what questions are being posed in a project — which are a direct catalyst to how you undertake the project in its entirety — the results will be lackluster at best, and completely useless at worst.
These 2 pieces of understanding taken together — that of the domain and that of questions being asked — should provide invaluable insights when plotting your move forward, such as:
- What is it we want to know?
- What questions are answerable from the data, and which are not?
- What more do we need to know in order to find out what we want to know?
2. Investigate the data
Investigating the data basically boils down to gaining a familiarity with the data of a given project. Exploratory data analysis, or EDA, is a particular approach to this gaining of familiarity, generally focusing on summarizing a dataset statistically and visually.
At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results.
With the rise of tools that enable easy implementation of powerful machine learning algorithms, it can become tempting to skip EDA. While it’s understandable why people take advantage of these algorithms, it’s not always a good idea to simply feed data into a black box—we have observed over and over again the critical value EDA provides to all types of data science problems.
3. Clean, prepare, and transform the data as required
You know the domain. You know the questions being asked. You have a handle on what is in the data, and how that maps to the ability to answer the questions you want answered.
It's time to get to the doing, right? Not so fast... first the data will need to be wrangled, pre-processed, prepared, or otherwise finessed in order to be useful in predictive modeling — or whatever other data science task it is you are wanting to undertake.
I would say that it is "identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data" in the context of "mapping data from one 'raw' form into another..." all the way up to "training a statistical model" which I like to think of data preparation as encompassing, or "everything from data sourcing right up to, but not including, model building."
Once the data has been rendered into a useful form, now it's time for the doing.
4. Approach problem solving from within a well-defined framework
Moving forward with our problem solving within a defined framework is the best plan.
It's not about any single or particular framework; it's about ensuring that you have some reasonable procedural approach in mind which is standardized, reliable, and measurable. Well known examples of such formal frameworks include Knowledge Discovery in Databases (KDD) Process, the cross-industry standard process for data mining (CRISP-DM), and Joe Blitzstein and Hanspeter Pfister's Data Science Process. You can read more about these approaches here.
One less formal approach is Aurélien Géron's Machine Learning Project Checklist, outlined in his book "Hands-On Machine Learning with Scikit-Learn & TensorFlow." 2 others are Yufeng Guo's 7 Steps of Machine Learning and Francois Chollet's Universal Workflow of Machine Learning from his book "Deep Learning with Python." You can read more about these 2 approaches here.
I also outline a simplified framework to machine learning in the above post, the 5 main areas of the machine learning process:
- Data collection and preparation: everything from choosing where to get the data, up to the point it is clean and ready for feature selection/engineering
- Feature selection and feature engineering: this includes all changes to the data from once it has been cleaned up to when it is ingested into the machine learning model
- Choosing the machine learning algorithm and training our first model: getting a "better than baseline" result upon which we can (hopefully) improve
- Evaluating our model: this includes the selection of the measure as well as the actual evaluation; seemingly a smaller step than others, but important to our end result
- Model tweaking, regularization, and hyperparameter tuning: this is where we iteratively go from a "good enough" model to our best effort
So, which framework should you use? Are there really any important differences? [...] Does this simplified framework provide any real benefit? As long as the bases are covered, and the tasks which explicitly exist in the overlap of the frameworks are tended to, the outcome of following [any of the] models would equal that of [any] other. Your vantage point or level of experience may exhibit a preference for one.
As you may have guessed, this has really been less about deciding on or contrasting specific frameworks than it has been an investigation of what a reasonable machine learning process should look like.
So there is our Fantastic Four of Data Science Project Preparation. Our version may not do battle with the likes of super villains such as Dr. Doom or Galactus, but they play a crucial role in the success of our analytics projects. Hopefully the value of preparation has been reinforced.
All comic personalities mentioned herein, and images used, are the sole and exclusive property of Marvel Comics.
- Data Avengers... Assemble!
- The Infinity Stones of Data Science
- The Machine Learning Puzzle, Explained