Silver BlogThe Machine Learning Puzzle, Explained

Lots of moving parts go into creating a machine learning model. Let's take a look at some of these core concepts and see how the machine learning puzzle comes together.

Previously I have written about the data science puzzle, an overview which defines a number of key concepts related to data science, and which which attempts to explain how these pieces relate to one another and fit together. This time we will take a similar look at the machine learning model.

Keep in mind we will be approaching machine learning from a supervised perspective, and all concepts are discussed with classification as our goal throughout (though regression would be similar). Importantly, this puzzle view will not cover other machine learning paradigms, such as unsupervised learning and reinforcement learning, so keep that in mind as the pieces are unveiled.

With that said, read on to find out how the machine learning puzzle comes together.

Machine learning puzzle

Machine learning is one of the primary technical drivers of data science. The goal of data science is to extract insight from data, and machine learning is the engine which allows this process to be automated. Machine learning algorithms continue to facilitate the automatic improvement of computer programs from experience, and these algorithms are becoming increasingly vital to a variety of diverse fields.

Data is what makes the world turn these days. Data is what is fed into machine learning algorithms in order for the "learning" to take place. A particular dataset is a static, finite collection of instances — or observations — and their corresponding features; if we are considering the data used to train and test models, a class label — or target — which defines the classification group of which an observation is a member, is also present. A simple way of thinking about data is as a table (though this need not be the case), and in this scenario rows in a table are instances, while columns are features. Data need not be arranged as a simple table; however, we do our best to keep data arranged into multidimensional arrays — of varying dimensions — which are robust enough to capture all sorts of data representations that we can come up with, provided we use the shoehorn of ingenuity at times.

The machine learning algorithm is the particular approach and sequence of steps which will be used in order to learn the best way to model the data. Algorithms use different methods of attempting to best predict the label of an instance of data based on that instance's feature values. Given a sufficient number of data instances, the idea is that a machine learning algorithm will be able to approximate with some degree of success (however you want to define success) what class a given instance falls in to, hence the term "classification." Hyperparameters are the knobs that can be turned in order to fine-tune the algorithmic "learning."

Simple put, algorithm + dataset = model. The model is the result of the data and algorithm after the training phase, at which point the resulting model has "learned" the best way of making sense of the data, in terms of how to go from an instance's set of features to the class prediction. The model is a mathematical function (created as the result of training; see below) which is able to take instances of the dataset and predict their class membership.

How does the algorithm use the data to create a model? This is accomplished via training, the procedure in which a portion of the available data, the training set, is fed into the algorithm, which undergoes its steps and produces a trained model. This trained model can then be shown unseen holdout data in the way of a validation set in order to see how well it performs classifying data it has never seen before. Finally, a testing set can be used to differentiate between the quality of differing models, which is a second set of unseen data which gets shown to final products only after they have been tweaked to their best possible performance otherwise.

The loss function is the mechanism by which comparisons between class predictions and actual class labels are made for training data instances. The goal of a machine learning algorithm is to iteratively minimize this loss — the value output by the loss function representing the distance between predictions and reality — and so the lower this value is, the better the model has learned to predict class labels.

What else is used to determine how well a model performs? There are a number of metrics available for model evaluation. The simplest form of such for classification is simple accuracy, which is the fraction of correctly made predictions. Precision and recall are another pair of useful classification metrics, and there are others still. Regression, classification's continuous cousin, uses a different set of metrics meant to determine the distance a prediction is from the corresponding actual value.



Source: Andrew Ng's Machine Learning class at Stanford


Has your model done a great job predicting the classes of your training data, but is not generalizing well to the holdout sets? You have fallen victim to overfitting, a scenario where the model's internal understanding of a problem is tightly linked to the particular instances in the training set. Overfitting is a failure to separate signal from noise and instead treats all data as signal.

One way to help fix overfitting is with regularization, which is a method of applying a penalty to particular model parameters in order to force the learning of a less flexible, less complex model. This lack of complexity should produce a model which does not capture the nuance of the training data as well, and should generalize better to unseen data.

There you have a brief look at some of the core concepts involved in the machine learning puzzle meant for prospective newcomers to the field. An understanding of these elementary concepts can help form a solid foundation on which to build your internal machine learning framework.