Frameworks for Approaching the Machine Learning Process
This post is a summary of 2 distinct frameworks for approaching machine learning tasks, followed by a distilled third. Do they differ considerably (or at all) from each other, or from other such processes available?
Is it worth comparing approaches to the machine learning process? Are there any fundamental differences between such frameworks?
Though classical approaches to such tasks exist, and have existed for some time, it is worth taking consult from new and different perspectives for a variety of reasons: Have I missed something? Are there new approaches which had not previously been considered? Should I change my perspective on how I approach machine learning?
The 2 most recent resources I've come across outlining frameworks for approaching the process of machine learning are Yufeng Guo's The 7 Steps of Machine Learning and section 4.5 of Francois Chollet's Deep Learning with Python. Are either of these anything different than how you already process just such a task?
What follows are outlines of these 2 supervised machine learning approaches, a brief comparison, and an attempt to reconcile the two into a third framework highlighting the most important areas of the (supervised) machine learning process.
The 7 Steps of Machine Learning
I actually came across Guo's article by way of first watching a video of his on YouTube. The post is the same content as the video, and so if interested one of the two resources will suffice.
Guo laid out the steps as follows (with a little ad-libbing on my part):
- Data Collection
→ The quantity & quality of your data dictate how accurate our model is
→ The outcome of this step is generally a representation of data (Guo simplifies to specifying a table) which we will use for training
→ Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this step
- Data Preparation
→ Wrangle data and prepare it for training
→ Clean that which may require it (remove duplicates, correct errors, deal with missing values, normalization, data type conversions, etc.)
→ Randomize data, which erases the effects of the particular order in which we collected and/or otherwise prepared our data
→ Visualize data to help detect relevant relationships between variables or class imbalances (bias alert!), or perform other exploratory analysis
→ Split into training and evaluation sets
- Choose a Model
→ Different algorithms are for different tasks; choose the right one
- Train the Model
→ The goal of training is to answer a question or make a prediction correctly as often as possible
→ Linear regression example: algorithm would need to learn values for m (or W) and b (x is input, y is output)
→ Each iteration of process is a training step
- Evaluate the Model
→ Uses some metric or combination of metrics to "measure" objective performance of model
→ Test the model against previously unseen data
→ This unseen data is meant to be somewhat representative of model performance in the real world, but still helps tune the model (as opposed to test data, which does not)
→ Good train/eval split? 80/20, 70/30, or similar, depending on domain, data availability, dataset particulars, etc.
- Parameter Tuning
→ This step refers to hyperparameter tuning, which is an "artform" as opposed to a science
→ Tune model parameters for improved performance
→ Simple model hyperparameters may include: number of training steps, learning rate, initialization values and distribution, etc.
- Make Predictions
→ Using further (test set) data which have, until this point, been withheld from the model (and for which class labels are known), are used to test the model; a better approximation of how the model will perform in the real world
Universal Workflow of Machine Learning
In section 4.5 of his book, Chollet outlines a universal workflow of machine learning, which he describes as a blueprint for solving machine learning problems:
The blueprint ties together the concepts we've learned about in this chapter: problem definition, evaluation, feature engineering, and fighting overfitting.
How does this compare with Guo's above framework? Let's have a look at the 7 steps of Chollet's treatment (keeping in mind that, while not explicitly stated as being specifically tailored for them, his blueprint is written for a book on neural networks):
- Defining the problem and assembling a dataset
- Choosing a measure of success
- Deciding on an evaluation protocol
- Preparing your data
- Developing a model that does better than a baseline
- Scaling up: developing a model that overfits
- Regularizing your model and tuning your parameters
Source: Andrew Ng's Machine Learning class at Stanford
Chollet's workflow is higher level, and focuses more on getting your model from good to great, as opposed to Guo's, which seems more concerned with going from zero to good. While it does not necessarily jettison any other important steps in order to do so, the blueprint places more emphasis on hyperparameter tuning and regularization in its pursuit of greatness. A simplification here seems to be:
Drafting A Simplified Framework
We can reasonably conclude that Guo's framework outlines a "beginner" approach to the machine learning process, more explicitly defining early steps, while Chollet's is a more advanced approach, emphasizing both the explicit decisions regarding model evaluation and the tweaking of machine learning models. Both approaches are equally valid, and do not prescribe anything fundamentally different from one another; you could superimpose Chollet's on top of Guo's and find that, while the 7 steps of the 2 models would not line up, they would end up covering the same tasks in sum.
Mapping Chollet's to Guo's, here is where I see the steps lining up (Guo's are numbered, while Chollet's are listed underneath the corresponding Guo step with their Chollet workflow step number in parenthesis):
- Data collection
→ Defining the problem and assembling a dataset (1)
- Data preparation
→ Preparing your data (4)
- Choose model
- Train model
→ Developing a model that does better than a baseline (5)
- Evaluate model
→ Choosing a measure of success (2)
→ Deciding on an evaluation protocol (3)
- Parameter tuning
→ Scaling up: developing a model that overfits (6)
→ Regularizing your model and tuning your parameters (7)
It's not perfect, but I stand by it.
In my view, this presents something important: both frameworks agree, and together place emphasis, on particular points of the framework. It should be clear that model evaluation and parameter tuning are important aspects of machine learning. Addition agreed-upon areas of importance are the assembly/preparation of data and original model selection/training.
A Simplified Machine Learning Process
Let's use the above to put together a simplified framework to machine learning, the 5 main areas of the machine learning process:
- Data collection and preparation
→ everything from choosing where to get the data, up to the point it is clean and ready for feature selection/engineering
- Feature selection and feature engineering
→ this includes all changes to the data from once it has been cleaned up to when it is ingested into the machine learning model
- Choosing the machine learning algorithm and training our first model
→ getting a "better than baseline" result upon which we can (hopefully) improve
- Evaluating our model
→ this includes the selection of the measure as well as the actual evaluation; seemingly a smaller step than others, but important to our end result
- Model tweaking, regularization, and hyperparameter tuning
→ this is where we iteratively go from a "good enough" model to our best effort
So, which framework should you use? Are there really any important differences? Do those presented by Guo and Chollet offer anything that was previously lacking? Does this simplified framework provide any real benefit? As long as the bases are covered, and the tasks which explicitly exist in the overlap of the frameworks are tended to, the outcome of following either of the two models would equal that of the other. Your vantage point or level of experience may exhibit a preference for one.
As you may have guessed, this has really been less about deciding on or contrasting specific frameworks than it has been an investigation of what a reasonable machine learning process should look like.
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.