Developing End-to-End Data Science Pipelines with Data Ingestion, Processing, and Visualization

Learn how to create a data science pipeline with a complete structure.

By Cornellius Yudha Wijaya, KDnuggets Technical Content Specialist on September 11, 2024 in Data Science

Developing End-to-End Data Science Pipelines with Data Ingestion, Processing, and Visualization

Data science projects are not about developing and not coming back to them. They involve a whole process, from acquiring the dataset to maintaining it. This iterative process ensures that the model always provides values.

The crucial part of the data science project is not in the model but in the beginning. If the data quality suffers and the preprocessing is not right, the downstream model will not bring valuable output. Maintaining a pipeline that can handle end-to-end data science processes becomes important.

We will learn and focus on data ingestion, processing, and visualization during the end-to-end data science pipeline development.

Standard End-to-End Data Science Project

When discussing end-to-end data science projects, we discuss the technical and business aspects. Data science projects exist to solve business problems, so every step needs to be remembered about the business.

In general, end-to-end data science projects are more or less follow these steps:

Business Understanding
Data Collection and Preparation
Building the Machine Learning Model
Model Optimization
Model Deployment
Model Monitoring

Each step is essential and needs to be followed in the sequence. When even one of the steps is missing, our data science project will not provide the optimal value.

Considering the above steps, this article will focus on the Data Collection and Preparation step. It would be based on data ingestion, processing, and visualization. We would create a simple data science end-to-end project pipeline but emphasize the Data Collection and preparation step.

Let’s start by exploring the steps and start to build the pipeline from there.

Data Ingestion

Data ingestion takes the previously collected dataset and puts it within the environment for the following process. How we ingest the data would be different depending on the data source.

Let me show you the code example. The easiest one is to ingest the data from a CSV or Excel file, which we can do with the following code.

import pandas as pd
data = pd.read_csv('data.csv')

Then, we can ingest the data from the data warehouse via SQL by creating the connection to the database.

import sqlalchemy

engine = sqlalchemy.create_engine('sqlite:///example.db')
data = pd.read_sql('SELECT * FROM table_name', engine)

Another popular way is to call an API request from the data source.

import requests
response = requests.get('https://api.example.com/data')
data = response.json()

Depending on your needs, there are still many ways to use the data ingestion process. For example, you can use web scraping.

from bs4 import BeautifulSoup
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')

That’s the basis for data ingestion. Let’s explore the data processing.

Data Processing

After we have the data, we must process it further to accommodate the business requirements and tasks. We must focus on this step as the project quality usually depends on the data processing.

Data exploration and processing are often tied together, so deciding how to process the data comes after the data exploration. Here are a few examples of data processing.

The first data processing we would see is data cleaning. We clean the data to improve the dataset quality. The example below is to drop the missing data and data duplicate removal.

# Data Cleaning
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)

Data transformation is also part of data processing, where the data is entered into other forms necessary for the data science project.

# Data Transformation
data['date'] = pd.to_datetime(data['date'])  
data['category_encoded'] = pd.get_dummies(data['category'])

We could also transform the data into the scale of what we need.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1', 'feature2']])

Also, when we process data, we often create new features from existing ones. This process is called Feature Engineering.

# Feature Engineering
data['new_feature'] = data['feature1'] * data['feature2']

Lastly, we can do data splitting to split the data into train and test data with the following code.

# Data Splitting
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=42)

You can still do so many things for data processing, as it depends on your project. Let’s get into the data visualization now.

Data Visualization

Data visualization might not be related to machine learning development, but it is essential in the data science project. By using data visualization, we can better understand the data insight and easily communicate any results we have.

Here are some code examples to produce the data visualization with Python.

First, we have the correlation heatmap plot to help understand the features' relationship with each other.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = data.corr()

sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()

Next, we have the pair plot, which visualizes a two-dimensional plot between each feature and shows the distribution of the target feature.

sns.pairplot(data, hue='target', diag_kind='kde') 
plt.title('Pair Plot') 
plt.show()

Then, we can visualize the importance of the feature from the model.

importance = model.coef_[0]
features = np.array(numeric_features.tolist() + list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)))

plt.figure(figsize=(10, 8))
sns.barplot(x=importance, y=features)
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

Lastly, we could visualize the confusion matrix during the model evaluation step.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, model.predict(X_test))

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

Developing the Data Science Pipeline

Let’s combine what we have learned above into one data science pipeline incorporating data ingestion, processing, and visualization.

We would use the Titanic data to develop a classification model for this example.

First, let’s ingest the data using Pandas.

import pandas as pd

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df_titanic = pd.read_csv(url)

After that, we would perform the data processing. Let’s use the code below to clean the dataset and perform data transformation.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

df_titanic = df_titanic[['Survived', 'Pclass', 'Sex','Parch', 'Fare','Age', 'Embarked']]

df_titanic = df_titanic.dropna(subset=['Survived'])

X = df_titanic.drop('Survived', axis=1)
y = df_titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = ['Age', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

After the data processing, we would develop the machine learning model.

from sklearn.linear_model import LogisticRegression
import joblib

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)


joblib.dump(model, 'titanic_logistic_regression_model.joblib')
joblib.dump(preprocessor, 'titanic_preprocessor.joblib')

accuracy = model.score(X_test, y_test)
print(accuracy)

Lastly, we can visualize the importance of the model feature and present it to the audience with the following code.

import matplotlib.pyplot as plt
import numpy as np

importance = model.coef_[0]
features = np.array(numeric_features + list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)))

plt.figure(figsize=(10, 8))
plt.barh(features, importance)
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

That’s all for developing a simple end-to-end data science pipeline with data ingestion, processing, and visualization. Depending on your data science project, you can add more steps in between.

Conclusion

Standardizing the end-to-end data science pipeline is essential if we want to continuously provide value to the business. By understanding the details of each step, especially data ingestion, processing, and visualization, we can improve the quality of our project and provide the best result to solve the business problem.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.