Build a Data Science App with Python in 10 Easy Steps

Learn how to build a data science app with Python, using Scikit-Learn and FastAPI, one step at a time.

By Bala Priya C, KDnuggets Contributing Editor & Technical Content Specialist on November 25, 2024 in Python

Build a Data Science App with Python in 10 Easy Steps

Image by Author | Created on Canva

Looking to further your data science skills? Building a data science app is a great way to learn more.

Building a data science application involves multiple steps—from data collection and preprocessing to model training and serving predictions via an API. This step-by-step tutorial will guide you through the process of creating a simple data science app.

We'll use Python, scikit-learn, and FastAPI to train a machine learning model and build an API to serve its predictions. To keep things simple, we’ll use the built-in wine dataset from scikit-learn. Let’s get started!

▶️ You can find the code on GitHub.

Step 1: Setting Up the Environment

You should have a recent version of Python installed. Then, install the necessary libraries for building the machine learning model and the API to serve the predictions:

$ pip3 install fastapi uvicorn scikit-learn pandas

Note: Be sure to install the required libraries in a virtual environment for the project.

Step 2: Loading the Dataset

We will use scikit-learn's wine dataset. Let’s load the dataset and convert it into a pandas dataframe for easy manipulation:

# model_training.py
from sklearn.datasets import load_wine
import pandas as pd

def load_wine_data():
    wine_data = load_wine()
    df = pd.DataFrame(data=wine_data.data, columns=wine_data.feature_names)
    df['target'] = wine_data.target  # Adding the target (wine quality class)
    return df

Step 3: Exploring the Dataset

Before we proceed, it’s good practice to explore the dataset a bit.

# model_training.py
if __name__ == "__main__":
    df = load_wine_data()
    print(df.head())
    print(df.describe())
    print(df['target'].value_counts())  # Distribution of wine quality classes

Here, we perform a preliminary exploration of the dataset by displaying the first few rows, generating summary statistics, and checking the distribution of the output classes.

Step 4: Data Preprocessing

Next, we will preprocess the dataset. We split the dataset into training and test sets, and scale the features.

The preprocess_data function does just that:

# model_training.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def preprocess_data(df):
    X = df.drop('target', axis=1)  # Features
    y = df['target']  # Target (wine quality)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

Feature scaling using StandardScaler ensures that all features contribute equally to the model training.

Step 5: Training the Logistic Regression Model

Let’s now train a LogisticRegression model on the preprocessed data and save the model to a pickle file. The following function train_model does that:

# model_training.py
from sklearn.linear_model import LogisticRegression
import pickle

def train_model(X_train, y_train):
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)

    # Save the trained model using pickle
    with open('classifier.pkl', 'wb') as f:
        pickle.dump(model, f)

    return model

Step 6: Evaluating the Model

Once the model is trained, we evaluate its performance by calculating the accuracy on the test set. To do so, let’s define the function evaluate_model like so:

# model_training.py
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

if __name__ == "__main__":
    df = load_wine_data()
    X_train_scaled, X_test_scaled, y_train, y_test = preprocess_data(df)
    model = train_model(X_train_scaled, y_train)
    evaluate_model(model, X_test_scaled, y_test)

When you run the Python script: the data is loaded, preprocessed, the model is trained and evaluated. Running the script now gives:

Accuracy: 0.98

Step 7: Setting Up FastAPI

Now, we’ll set up a basic FastAPI application that will serve predictions using our trained model.

# app.py
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"message": "A Simple Prediction API"}

In this step, we set up a basic FastAPI application and defined a root endpoint. This creates a simple web server that can respond to HTTP requests.

You can run the FastAPI app with:

uvicorn app:app --reload

Go to http://127.0.0.1:8000 to see the message.

Step 8: Loading the Model in FastAPI

We’ll load the pre-trained model within FastAPI to make predictions.

Let’s go ahead and define a function to load the pre-trained Logistic Regression model within our FastAPI application.

# app.py
import pickle

def load_model():
    with open('model/classifier.pkl', 'rb') as f:
        model = pickle.load(f)
    return model

This means our model is ready to make predictions when requests are received.

Step 9: Creating the Prediction Endpoint

We’ll define an endpoint to accept wine features as input and return the predicted wine quality class.

Define Input Data Model

We’d like to create a prediction endpoint that accepts wine feature data in JSON format. The input data model—defined using Pydantic—validates the incoming data.

# app.py
from pydantic import BaseModel

class WineFeatures(BaseModel):
    alcohol: float
    malic_acid: float
    ash: float
    alcalinity_of_ash: float
    magnesium: float
    total_phenols: float
    flavanoids: float
    nonflavanoid_phenols: float
    proanthocyanins: float
    color_intensity: float
    hue: float
    od280_od315_of_diluted_wines: float
    proline: float

Prediction Endpoint

When a request is received, the API uses the loaded model to predict the wine class based on the provided features.

# app.py
@app.post("/predict")
def predict_wine(features: WineFeatures):
    model = load_model()
    input_data = [[
        features.alcohol, features.malic_acid, features.ash, features.alcalinity_of_ash,
        features.magnesium, features.total_phenols, features.flavanoids,
        features.nonflavanoid_phenols, features.proanthocyanins, features.color_intensity,
        features.hue, features.od280_od315_of_diluted_wines, features.proline
    ]]
    
    prediction = model.predict(input_data)
    return {"prediction": int(prediction[0])}

Step 10: Testing the Application Locally

You can rerun the app by running:

uvicorn app:app --reload

To test the application, send a POST request to the /predict endpoint with wine feature data:

curl -X POST "http://127.0.0.1:8000/predict" \
-H "Content-Type: application/json" \
-d '{
	"alcohol": 13.0,
	"malic_acid": 2.14,
	"ash": 2.35,
	"alcalinity_of_ash": 20.0,
	"magnesium": 120,
	"total_phenols": 3.1,
	"flavanoids": 2.6,
	"nonflavanoid_phenols": 0.29,
	"proanthocyanins": 2.29,
	"color_intensity": 5.64,
	"hue": 1.04,
	"od280_od315_of_diluted_wines": 3.92,
	"proline": 1065
}'

Testing locally is important to ensure that the API works as intended before any deployment. So we test the application by sending a POST request to the prediction endpoint with sample wine feature data and get the predicted class.

{"prediction":0}

Wrapping Up

We’ve built a simple yet functional data science app.

After building a machine learning model with scikit-learn, we used FastAPI to create an API that accepts user input and returns predictions. You can try building more complex models, add features, and much more.

As a next step, you can explore different datasets, models, or even deploy the application to production. Read A Practical Guide to Deploying Machine Learning Models to learn more.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.