Developing End-to-End Data Science Pipelines with Data Ingestion, Processing, and Visualization
Learn how to create a data science pipeline with a complete structure.
Image by macrovector on FreepikData science projects are not about developing and not coming back to them. They involve a whole process, from acquiring the dataset to maintaining it. This iterative process ensures that the model always provides values.
The crucial part of the data science project is not in the model but in the beginning. If the data quality suffers and the preprocessing is not right, the downstream model will not bring valuable output. Maintaining a pipeline that can handle end-to-end data science processes becomes important.
We will learn and focus on data ingestion, processing, and visualization during the end-to-end data science pipeline development.
Standard End-to-End Data Science Project
When discussing end-to-end data science projects, we discuss the technical and business aspects. Data science projects exist to solve business problems, so every step needs to be remembered about the business.
In general, end-to-end data science projects are more or less follow these steps:
- Business Understanding
- Data Collection and Preparation
- Building the Machine Learning Model
- Model Optimization
- Model Deployment
- Model Monitoring
Each step is essential and needs to be followed in the sequence. When even one of the steps is missing, our data science project will not provide the optimal value.
Considering the above steps, this article will focus on the Data Collection and Preparation step. It would be based on data ingestion, processing, and visualization. We would create a simple data science end-to-end project pipeline but emphasize the Data Collection and preparation step.
Let’s start by exploring the steps and start to build the pipeline from there.
Data Ingestion
Data ingestion takes the previously collected dataset and puts it within the environment for the following process. How we ingest the data would be different depending on the data source.
Let me show you the code example. The easiest one is to ingest the data from a CSV or Excel file, which we can do with the following code.
import pandas as pd
data = pd.read_csv('data.csv')
Then, we can ingest the data from the data warehouse via SQL by creating the connection to the database.
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:///example.db')
data = pd.read_sql('SELECT * FROM table_name', engine)
Another popular way is to call an API request from the data source.
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
Depending on your needs, there are still many ways to use the data ingestion process. For example, you can use web scraping.
from bs4 import BeautifulSoup
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
That’s the basis for data ingestion. Let’s explore the data processing.
Data Processing
After we have the data, we must process it further to accommodate the business requirements and tasks. We must focus on this step as the project quality usually depends on the data processing.
Data exploration and processing are often tied together, so deciding how to process the data comes after the data exploration. Here are a few examples of data processing.
The first data processing we would see is data cleaning. We clean the data to improve the dataset quality. The example below is to drop the missing data and data duplicate removal.
# Data Cleaning
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)
Data transformation is also part of data processing, where the data is entered into other forms necessary for the data science project.
# Data Transformation
data['date'] = pd.to_datetime(data['date'])
data['category_encoded'] = pd.get_dummies(data['category'])
We could also transform the data into the scale of what we need.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1', 'feature2']])
Also, when we process data, we often create new features from existing ones. This process is called Feature Engineering.
# Feature Engineering
data['new_feature'] = data['feature1'] * data['feature2']
Lastly, we can do data splitting to split the data into train and test data with the following code.
# Data Splitting
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=42)
You can still do so many things for data processing, as it depends on your project. Let’s get into the data visualization now.
Data Visualization
Data visualization might not be related to machine learning development, but it is essential in the data science project. By using data visualization, we can better understand the data insight and easily communicate any results we have.
Here are some code examples to produce the data visualization with Python.
First, we have the correlation heatmap plot to help understand the features' relationship with each other.
import seaborn as sns
import matplotlib.pyplot as plt
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
Next, we have the pair plot, which visualizes a two-dimensional plot between each feature and shows the distribution of the target feature.
sns.pairplot(data, hue='target', diag_kind='kde')
plt.title('Pair Plot')
plt.show()
Then, we can visualize the importance of the feature from the model.
importance = model.coef_[0]
features = np.array(numeric_features.tolist() + list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)))
plt.figure(figsize=(10, 8))
sns.barplot(x=importance, y=features)
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
Lastly, we could visualize the confusion matrix during the model evaluation step.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, model.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()
Developing the Data Science Pipeline
Let’s combine what we have learned above into one data science pipeline incorporating data ingestion, processing, and visualization.
We would use the Titanic data to develop a classification model for this example.
First, let’s ingest the data using Pandas.
import pandas as pd
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df_titanic = pd.read_csv(url)
After that, we would perform the data processing. Let’s use the code below to clean the dataset and perform data transformation.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
df_titanic = df_titanic[['Survived', 'Pclass', 'Sex','Parch', 'Fare','Age', 'Embarked']]
df_titanic = df_titanic.dropna(subset=['Survived'])
X = df_titanic.drop('Survived', axis=1)
y = df_titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
numeric_features = ['Age', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)
After the data processing, we would develop the machine learning model.
from sklearn.linear_model import LogisticRegression
import joblib
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, 'titanic_logistic_regression_model.joblib')
joblib.dump(preprocessor, 'titanic_preprocessor.joblib')
accuracy = model.score(X_test, y_test)
print(accuracy)
Lastly, we can visualize the importance of the model feature and present it to the audience with the following code.
import matplotlib.pyplot as plt
import numpy as np
importance = model.coef_[0]
features = np.array(numeric_features + list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)))
plt.figure(figsize=(10, 8))
plt.barh(features, importance)
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
That’s all for developing a simple end-to-end data science pipeline with data ingestion, processing, and visualization. Depending on your data science project, you can add more steps in between.
Conclusion
Standardizing the end-to-end data science pipeline is essential if we want to continuously provide value to the business. By understanding the details of each step, especially data ingestion, processing, and visualization, we can improve the quality of our project and provide the best result to solve the business problem.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.