A Beginner’s Guide to End to End Machine Learning

Learn to train, tune, deploy and monitor machine learning models.

By Rebecca Vickery, Data Scientist

Supervised machine learning is a technique that maps a series of inputs (X) to some known outputs (y) without being explicitly programmed. Training a machine learning model refers to the process where a machine learns a mapping between X and y. Once trained the model can be used to make predictions on new inputs where the output is unknown.

The training of a machine learning model is only one element of the end to end machine learning lifecycle. For a model to be truly useful this mapping needs to be stored and deployed for use. This is often referred to as putting the model into production. Additionally once a model is in production, the predictions and overall performance of the model need to be monitored to ensure that the quality of the predictions does not degrade over time.

To introduce the basic concepts in the end to end machine learning workflow I am going to use the python library, Pycaret. Pycaret is a low code machine learning library that seeks to simplify and speed up the overall machine learning workflow by providing a high-level programming interface and seeking to automate some of the repetitive tasks in machine learning.

The library is developing quickly and has more recently added functionality to encompass the entire end to end machine learning workflow. From model development to deployment and monitoring. In the following article, I will be using this package to give a simple introduction to the end to end machine learning lifecycle.

The data

In this tutorial, I am going to use one of Pycaret’s built-in datasets known as “employee”. This consists of a set of features about employees at an unspecified company and a target variable that denotes if that employee has left the company. This is a classic classification task in machine learning where the goal is to train a model that can predict if an employee is likely to leave.

The data can easily be imported through the Pycaret API. In the below code we read in the data and reserve a validation sample for later use in the workflow.

The first few lines of the data look as follows:

First few rows of the “employee” dataset. Image by Author.

Preprocessing

Pycaret has a set of modules that contain a suite of functions for a specific machine learning task. The dataset we are using contains a classification problem so we will be primarily using the classification module.

The first step is to import the setup function. This function is run prior to performing any other steps. It initialises the Pycaret environment and creates a transformation pipeline for preprocessing the data ready for modelling. When run Pycaret will infer the data types for all feature and target columns.

Pycaret has inferred the data types for the features and target variable. Image by Author

If we are happy that these data types are correct and are also happy to rely on Pycaret’s default preprocessing methods then we can simply hit enter and Pycaret will prepare the data for modelling, and print a report describing the steps that have been taken.

Below is just the first 15 lines of this 59 line report. The setup functions has a large number of optional parameters that can be used to create custom preprocessing. For example, the parameter categorical_features can be used to manually specify categorical columns where the data type has not been inferred correctly. A full list of available parameters can be found here.

The first 15 lines of the preprocessing report. Image by Author.

In a real-world machine learning project, it is likely that we would perform more in-depth exploratory analysis and custom preprocessing. However, for the purposes of this tutorial, we will continue with the default preprocessing.

Baseline model

Typically in machine learning workflows, it is sensible to train a simple model first to develop a baseline for performance before moving onto more complex algorithms.

At the time of writing, Pycaret supports 18 different algorithms for classification. The full list can be found here. For the baseline model, we will use logistic regression and we use the function create_model to train. To select the algorithm we pass in the abbreviated string found in the documentation. After training, Pycaret will print a report detailing the performance of our model.

The create_mode report. Image by Author.

Comparing models

By establishing a baseline model we have determined that the preprocessing and setup is sufficient to build a classification model.

The next step is to compare the different classification algorithms to determine which is the best one to use for our dataset and problem. Pycaret has a function called compare_models() which allows us to compare the performance of all available algorithms with one line of code.

This function will try all algorithms and output a list ranked by accuracy. You can change the metric to target using the sort argument.

Comparing models. Image by Author.

We can see from this that Random Forest is overall the best model. We will use this model in the rest of our workflow.

We run create_model() again this time selecting the Random Forest algorithm.

Tuning the model

When we run create_model it automatically uses the default parameters. These are not necessarily the best set of parameters to use so we need to tune the model to find the best selection. Running the tune_model() function tunes the hyperparameters of the model using Random grid search.

By default, the function will use a pre-defined grid of parameters but this can be customised by passing a custom grid into the custom_grid parameter. By default, the function will also optimise the accuracy score but this can also be customised using the optimize parameter.

In the below code we use the default parameter grid to tune the model and optimise for the MCC score.

The function returns a table containing k-fold validated scores, by default the number of folds is 10.

Model tuning. Image by Author.

Interpret the results

Pycaret has a selection of built-in plots to interpret the results of the model which can be accessed using the plot_model() function.

Below we inspect feature importances using this function.

Feature importances. Image by Author.

Let’s also visualise the confusion matrix.

The confusion matrix. Image by Author.

Deploy the model

Pycaret has additional functionality to deploy models to the cloud using AWS.

To deploy the model on AWS you must first create an account on aws.amazon.com. Once created you need to generate an access key so that you can allow the Pycaret function to write to your account. To do this follow the following steps.

Login to your AWS account.
Hover over your username in the top right corner.
Select security credentials from the list that appears.

4. From here expand the Access keys section and click on ‘Create new access key’.

5. At this stage you need to download the credentials file when prompted. You will need the credentials contained file in this later.

6. To allow Pycaret to interact with your AWS account you also need to install and configure the AWS CLI. To do this first run the following.

curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"

Then run this:

sudo installer -pkg ./AWSCLIV2.pkg -target /

Now type aws --version to confirm everything has been installed correctly. You should see something like this.

Now that this is installed we can configure the AWS CLI with your account details. To do this run the command shown below. It will ask you to input the following pieces of information.

AWS Access Key ID: this can be found in the credentials file we downloaded earlier.

AWS Secret Acess Key: also found in the credentials file.

Default region name: this can be found on the AWS console.

Default output format: this should be left blank.

Next, we create an S3 bucket to store the deployed model. From the AWS console select S3 and then create a bucket with your chosen name.

Now we are ready to use Pycaret to deploy our chosen model.

If the model has deployed you will see the following message.

Generate predictions

To use the model we generate predictions on the validation data we reserved earlier.

The code shown below loads the model from S3 and generates predictions on the validation data with the labels removed.

Let’s view the first 5 predictions by running predictions[:5].

First 5 predictions. Image by Author

Monitoring

An important aspect of the production machine learning workflow is tracking and monitoring the experiments that are performed. Pycaret has integration with MLflow an open-source platform for managing the end to end machine learning lifecycle.

To use MLflow via Pycaret we need to import MLflow (this should already have been installed along with Pycaret) and set the tracking URI. We then add a couple of extra parameters as shown below when calling the setup function.

Now if we copy the tracking URI into a browser you should see the mlflow UI and the experiments it has tracked.

mlflow UI. Image by author

We can click through to view the metrics, artifacts and params for each experiment performed.

Experiment tracking. Image by Author

This is a simple tutorial to learn the basic steps involved in an end to end machine learning project. In the real world most machine learning problems, data and deployment solutions require much more complex treatment. However, for the purposes of this tutorial I have used a simple dataset and python library to get you started with learning about the end to end machine learning lifecycle.

Pycaret is an example of a low-code machine learning library. There are a number of tools being developed that seek to simplify machine learning development in different ways. To learn more about these low effort machine learning libraries see my earlier article.

Low Effort Machine Learning Tools

For more information about machine learning in the real world see the article below.

Machine Learning in Real Life

Thanks for reading!

Bio: Rebecca Vickery is a Data Scientist with extensive experience of data analysis, machine learning and data engineering. 12 years experience SQL, 4+ years Python, R, Apache Airflow and Google Analytics.

Related: