Docker for Python & Data Projects: A Beginner’s Guide

Managing dependencies for Python data projects can get messy fast. Docker helps you create consistent environments you can build, share, and deploy with ease.



Docker for Python & Data Projects: A Beginner's Guide
Image by Author

 

Introduction

 
Python and data projects have a dependency problem. Between Python versions, virtual environments, system-level packages, and operating system differences, getting someone else's code to run on your machine can sometimes take longer than understanding the code itself.

Docker solves this by packaging your code and its entire environment — Python version, dependencies, system libraries — into a single artifact called the image. From the image you can start containers that run identically on your laptop, your teammate's machine, and a cloud server. You stop debugging environments and start shipping work.

In this article, you'll learn Docker through practical examples with a focus on data projects: containerizing a script, serving a machine learning model with FastAPI, wiring up a multi-service pipeline with Docker Compose, and scheduling a job with a cron container.

 

Prerequisites

 
Before working through the examples, you'll need:

  • Docker and Docker Compose installed for your operating system. Follow the official installation guide for your platform.
  • Familiarity with the command line and Python.
  • Familiarity with writing a Dockerfile, building an image, and running a container from that image.

If you’d like a quick refresher, here are a couple of articles to get you up to speed:

You don't need deep Docker knowledge to follow along. Each example explains what's happening as it goes.

 

Containerizing a Python Script with Pinned Dependencies

 
Let's start with the most common use case: you have a Python script and a requirements.txt, and you want it to run reliably anywhere.

We'll build a data cleaning script that reads a raw sales CSV file, removes duplicates, fills in missing values, and writes a cleaned version to disk.

 

// Structuring the Project

The project is organized as follows:

data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
    └── raw_sales.csv

 

// Writing the Script

Here's the data cleaning script that uses Pandas to do the heavy lifting:

# clean_data.py
import pandas as pd
import os

INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"

print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")

# Drop duplicate rows
df = df.drop_duplicates()

# Fill missing numeric values with column median
for col in df.select_dtypes(include='number').columns:
    df[col] = df[col].fillna(df[col].median())

# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].fillna('Unknown')

print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")

 

// Pinning Dependencies

Pinning exact versions is important. Without it, pip install pandas might install different versions on different machines. Pinned versions guarantee everyone gets the same behavior. You can define the exact versions in the requirements.txt file like so:

pandas==2.2.0
openpyxl==3.1.2

 

// Defining the Dockerfile

This Dockerfile builds a minimal, cache-friendly image for the cleaning script:

# Use a slim Python 3.11 base image
FROM python:3.11-slim

# Set the working directory inside the container
WORKDIR /app

# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the script into the container
COPY clean_data.py .

# Default command to run when the container starts
CMD ["python", "clean_data.py"]

 
There are a few things worth explaining here. We use python:3.11-slim instead of the full Python image because it's significantly smaller and strips out packages you don't need.

We copy requirements.txt before copying the rest of the code and this is intentional. Docker builds images in layers and caches each one. If you only change clean_data.py, Docker won't reinstall all your dependencies on the next build. It reuses the cached pip layer and jumps straight to copying your updated script. That small ordering decision can save you minutes of rebuild time.

 

// Building and Running

With the image built, you can run the container and mount your local data folder:

# Build the image and tag it
docker build -t data-cleaner .

# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleaner

 
The -v $(pwd)/data:/app/data flag mounts your local data/ folder into the container at /app/data. This is how the script reads your CSV and how the cleaned output gets written back to your machine. Nothing is baked into the image and the data stays on your filesystem.

The --rm flag automatically removes the container after it finishes. Since this is a one-off script, there's no reason to keep a stopped container lying around.

 

Serving a Machine Learning Model with FastAPI

 
You've trained a model and you want to make it available over HTTP so other services can send data and get predictions back. FastAPI works great for this: it's fast, lightweight, and handles input validation with Pydantic.

 

// Structuring the Project

The project separates the model artifact from the application code:

ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl

 

// Writing the App

The following app loads the model once at startup and exposes a /predict endpoint:

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np

app = FastAPI(title="Sales Forecast API")

# Load the model once at startup
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

class PredictRequest(BaseModel):
    region: str
    month: int
    marketing_spend: float
    units_in_stock: int

class PredictResponse(BaseModel):
    region: str
    predicted_revenue: float

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
    try:
        features = [[
            request.month,
            request.marketing_spend,
            request.units_in_stock
        ]]
        prediction = model.predict(features)
        return PredictResponse(
            region=request.region,
            predicted_revenue=round(float(prediction[0]), 2)
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

 
The PredictRequest class does the input validation for you. If someone sends a request with a missing field or a string where a number is expected, FastAPI rejects it with a clear error message before your model code even runs. The model is loaded once at startup — not on every request — which keeps response times fast.

The /health endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check whether your service is actually up and ready.

 

// Defining the Dockerfile

This Dockerfile bakes the model directly into the image so the container is fully self-contained:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the model and the app together
COPY model.pkl .
COPY app.py .

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

 
The model.pkl is baked into the image at build time. This means the container is completely self-contained, and you don't need to mount anything when you run it. The --host 0.0.0.0 flag tells Uvicorn to listen on all network interfaces inside the container, not just localhost. Without this, you won't be able to reach the API from outside the container.

 

// Building and Running

Build the image and start the API server:

docker build -t ml-api .
docker run --rm -p 8000:8000 ml-api

 
Test it with curl:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'

 

Building a Multi-Service Pipeline with Docker Compose

 
Real data projects rarely involve just one process. You might need a database, a script that loads data into it, and a dashboard that reads from it — all running together.

Docker Compose lets you define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.

 

// Structuring the Project

The pipeline splits each service into its own subdirectory:

pipeline/
├── docker-compose.yml
├── loader/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── load_data.py
└── dashboard/
    ├── Dockerfile
    ├── requirements.txt
    └── app.py

 

// Defining the Compose File

This Compose file declares all three services and wires them together with health checks and shared URL environment variables:

# docker-compose.yml
version: "3.9"

services:

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: analytics
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin -d analytics"]
      interval: 5s
      retries: 5

  loader:
    build: ./loader
    depends_on:
      db:
        condition: service_healthy
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

  dashboard:
    build: ./dashboard
    depends_on:
      db:
        condition: service_healthy
    ports:
      - "8501:8501"
    environment:
      DATABASE_URL: postgresql://admin:secret@db:5432/analytics

volumes:
  pgdata:

 

// Writing the Loader Script

This script waits briefly for the database, then loads a CSV into the sales table using SQLAlchemy:

# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time

DATABASE_URL = os.environ["DATABASE_URL"]

# Give the DB a moment to be fully ready
time.sleep(3)

engine = create_engine(DATABASE_URL)

df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)

print(f"Loaded {len(df)} rows into the sales table.")

 
Let’s take a closer look at the Compose file. Each service runs in its own container, but they're all on the same Docker-managed network, so they can reach each other using the service name as a hostname. The loader connects to db:5432 — and not localhost — because db is the service name, and Docker handles the DNS resolution automatically.

The healthcheck on the PostgreSQL service is important. depends_on alone only waits for the container to start, not for PostgreSQL to be ready to accept connections. The healthcheck uses pg_isready to confirm the database is actually up before the loader tries to connect. The pgdata volume persists the database between runs; stopping and restarting the pipeline won't wipe your data.

 

// Starting Everything

Bring up all services with a single command:

docker compose up --build

 
To stop everything, run:

docker compose down

 

Scheduling Jobs with a Cron Container

 
Sometimes you need a script to run on a schedule. Maybe it fetches data from an API every hour and writes it to a database or a file. You don't want to set up a full orchestration system like Airflow for something this simple. A cron container does the job cleanly.

 

// Structuring the Project

The project includes a crontab file alongside the script and Dockerfile:

data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab

 

// Writing the Fetch Script

This script uses Requests to hit an API endpoint and saves the results as a timestamped CSV:

# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os

API_URL = "https://api.example.com/sales/latest"
OUTPUT_DIR = "/app/output"

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"[{datetime.now()}] Fetching data...")

response = requests.get(API_URL, timeout=10)
response.raise_for_status()

data = response.json()
df = pd.DataFrame(data["records"])

timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)

print(f"[{datetime.now()}] Saved {len(df)} records to {output_path}")

 

// Defining the Crontab

The crontab schedules the script to run every hour and redirects all output to a log file:

# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1

 
The >> /var/log/fetch.log 2>&1 part redirects both standard output and error output to a log file. This is how you inspect what happened after the fact.

 

// Defining the Dockerfile

This Dockerfile installs cron, registers the schedule, and keeps it running in the foreground:

FROM python:3.11-slim

# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job

# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job

# cron -f runs cron in the foreground, which is required for Docker
CMD ["cron", "-f"]

 
The cron -f flag is important here. Docker keeps a container alive as long as its main process is running. If cron ran in the background (its default), the main process would exit immediately and Docker would stop the container. The -f flag keeps cron running in the foreground so the container stays alive.

 

// Building and Running

Build the image and start the container in detached mode:

docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher

 
Check the logs any time:

docker exec fetcher cat /var/log/fetch.log

 
The output folder is mounted from your local machine, so the CSV files land on your filesystem even though the script runs inside the container.

 

Wrapping Up

 
I hope you found this Docker article helpful. Docker doesn't have to be complicated. Start with the first example, swap in your own script and dependencies, and get comfortable with the build-run cycle. Once you've done that, the other patterns follow naturally. Docker is a good fit when:

  • You need reproducible environments across machines or team members
  • You're sharing scripts or models that have specific dependency requirements
  • You're building multi-service systems that need to run together reliably
  • You want to deploy anywhere without setup friction

That said, you don’t always need to use Docker for all of your Python work. It's probably overkill when:

  • You're doing quick, exploratory analysis only for yourself
  • Your script has no external dependencies beyond the standard library
  • You're early in a project and your requirements are changing rapidly

If you're interested in going further, check out 5 Simple Steps to Mastering Docker for Data Science.

Happy coding!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!