How to Set Up Your First Machine Learning Pipeline Using Scikit-Learn

Keep your ML workflow organized! Pipelines are like a checklist you don’t have to keep track of—Scikit-Learn handles it all for you.

By Kanwal Mehreen, KDnuggets Technical Editor & Content Specialist on December 10, 2024 in Machine Learning

Machine Learning Pipeline Using Scikit-Learn

Image by Author | Canva

Scikit-Learn is a popular Python library with numerous tools to make your machine learning projects simple and efficient. These projects comprise several steps including, but not limited to, data preprocessing, model training, and predicting unseen data. It’s important to process data in a consistent way to ensure reliable and reproducible results.

Scikit-Learn's Pipelines lets you aggregate multi-step machine-learning workflows, making it easier to maintain. This ensures all your data is handled uniformly, from start to finish.

Why Use Scikit-Learn’s Pipeline?

Scikit-Learn’s Pipeline feature works nicely with the library’s API having the same methods and function calls. It also simplifies testing by allowing you to evaluate the entire pipeline as a single entity. Additionally, you can perform hyperparameter tuning on the complete pipeline (e.g., using GridSearchCV) rather than optimizing each part separately.

In general, it offers the following benefits:

Simplicity: Combine preprocessing and model training in one step.
Reusability: Easily reuse the same pipeline with different datasets.
Reduced Error: Avoid common mistakes like forgetting to apply transformations to test data.

Step-by-Step Guide to Create Your Machine Learning Pipeline Using Scikit-Learn

Let’s create our first ML pipeline using Scikit-Learn. We’ll use a Logistic Regression model to train on the classic Iris dataset. The general process can be broken down into the following steps:

Step 1 - Set Up Your Environment and Install Required Libraries

We will first create a fresh Python environment:

python3 -m venv venv
source venv/bin/activate

For this project, we only need the Scikit-Learn library. Additionally, we will install Pandas to organize the dataset into a data frame for easier exploration and visualization. You can install both libraries using the following command:

pip install scikit-learn pandas

Step 2 - Load the Iris Dataset

The Iris dataset is a simple, built-in dataset in Scikit-Learn, used to classify flowers based on their characteristics like petal and sepal sizes. Let’s load the dataset and view 5 random samples to better understand its structure.

from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.sample(5)

Output:

Step 3 - Split the Dataset

A standard approach in machine learning is to split the complete dataset into training and testing partitions. This helps us train the model on one portion of the data and evaluate its performance on unseen data. We need to be careful that training and test datasets are processed similarly or we will have unexpected results. We will see how Scikit-Learn’s pipeline feature will make this fairly simple.

Use the below code to split your dataset into train and test datasets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4 - Define Your Pipeline

Even for this simple dataset, we need a preprocessing step to standardize our inputs. Our features are numeric values that can vary in range. To have a robust machine learning model, we will need to normalize the floating point values around its mean using z-score scaling. This can be easily done with the StandardScaler in Scikit-Learn.

The Logistic Regression classifier will then train on this standardized data, and the testing dataset must also be normalized using the same mean and standard deviation values to maintain consistency.

Now, let's create a sequential pipeline that will handle this for us automatically.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


pipeline = Pipeline([
	('scaler', StandardScaler()),  # Step 1: Standardize features
	('model', LogisticRegression())  # Step 2: Logistic Regression model
])

Step 5 - Train and Evaluate the Model

For training and evaluation, the pipeline uses the same standard methods as Scikit-Learn’s machine learning models, making it extremely simple to use.
Now, let's train and evaluate the model using the below code:

from sklearn.metrics import accuracy_score

pipeline.fit(X_train, y_train)	# Model training

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Output -> Model Accuracy: 100.00%

Notice how we didn’t need to manually process the training or testing datasets. Our pipeline handled it for us automatically. This has several practical use cases in production machine learning workflows, especially when we have multiple features that need to be handled differently or when there are several preprocessing steps. It can become fairly difficult to handle the processing at multiple pipeline stages and maintain them over time. With pipelines, we can aggregate everything in the same place so it is easier to change a portion of the workflow without having to manage it separately for training and evaluation stages.

Wrapping Up

And Done! We created our very own ML pipeline using Scikit-Learn. Even though this was a fairly simple example, it was intended to familiarize you with the use case and how it can be really beneficial in large-scale projects. I hope it was useful to you and to explore further, look into Feature Union, which allows separate preprocessing for different attributes of the dataset. In case you have a mix of nominal and numeric features, you can easily use different preprocessing steps for each of them and combine them all in a single pipeline.

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.