Docker for Python & Data Projects: A Beginner’s Guide
Managing dependencies for Python data projects can get messy fast. Docker helps you create consistent environments you can build, share, and deploy with ease.

Image by Author
# Introduction
Python and data projects have a dependency problem. Between Python versions, virtual environments, system-level packages, and operating system differences, getting someone else's code to run on your machine can sometimes take longer than understanding the code itself.
Docker solves this by packaging your code and its entire environment — Python version, dependencies, system libraries — into a single artifact called the image. From the image you can start containers that run identically on your laptop, your teammate's machine, and a cloud server. You stop debugging environments and start shipping work.
In this article, you'll learn Docker through practical examples with a focus on data projects: containerizing a script, serving a machine learning model with FastAPI, wiring up a multi-service pipeline with Docker Compose, and scheduling a job with a cron container.
# Prerequisites
Before working through the examples, you'll need:
- Docker and Docker Compose installed for your operating system. Follow the official installation guide for your platform.
- Familiarity with the command line and Python.
- Familiarity with writing a Dockerfile, building an image, and running a container from that image.
If you’d like a quick refresher, here are a couple of articles to get you up to speed:
- 10 Essential Docker Concepts Explained in Under 10 Minutes
- A Gentle Introduction to Docker for Python Developers
You don't need deep Docker knowledge to follow along. Each example explains what's happening as it goes.
# Containerizing a Python Script with Pinned Dependencies
Let's start with the most common use case: you have a Python script and a requirements.txt, and you want it to run reliably anywhere.
We'll build a data cleaning script that reads a raw sales CSV file, removes duplicates, fills in missing values, and writes a cleaned version to disk.
// Structuring the Project
The project is organized as follows:
data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
└── raw_sales.csv
// Writing the Script
Here's the data cleaning script that uses Pandas to do the heavy lifting:
# clean_data.py
import pandas as pd
import os
INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"
print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")
# Drop duplicate rows
df = df.drop_duplicates()
# Fill missing numeric values with column median
for col in df.select_dtypes(include='number').columns:
df[col] = df[col].fillna(df[col].median())
# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include='object').columns:
df[col] = df[col].fillna('Unknown')
print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")
// Pinning Dependencies
Pinning exact versions is important. Without it, pip install pandas might install different versions on different machines. Pinned versions guarantee everyone gets the same behavior. You can define the exact versions in the requirements.txt file like so:
pandas==2.2.0
openpyxl==3.1.2
// Defining the Dockerfile
This Dockerfile builds a minimal, cache-friendly image for the cleaning script:
# Use a slim Python 3.11 base image
FROM python:3.11-slim
# Set the working directory inside the container
WORKDIR /app
# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the script into the container
COPY clean_data.py .
# Default command to run when the container starts
CMD ["python", "clean_data.py"]
There are a few things worth explaining here. We use python:3.11-slim instead of the full Python image because it's significantly smaller and strips out packages you don't need.
We copy requirements.txt before copying the rest of the code and this is intentional. Docker builds images in layers and caches each one. If you only change clean_data.py, Docker won't reinstall all your dependencies on the next build. It reuses the cached pip layer and jumps straight to copying your updated script. That small ordering decision can save you minutes of rebuild time.
// Building and Running
With the image built, you can run the container and mount your local data folder:
# Build the image and tag it
docker build -t data-cleaner .
# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleaner
The -v $(pwd)/data:/app/data flag mounts your local data/ folder into the container at /app/data. This is how the script reads your CSV and how the cleaned output gets written back to your machine. Nothing is baked into the image and the data stays on your filesystem.
The --rm flag automatically removes the container after it finishes. Since this is a one-off script, there's no reason to keep a stopped container lying around.
# Serving a Machine Learning Model with FastAPI
You've trained a model and you want to make it available over HTTP so other services can send data and get predictions back. FastAPI works great for this: it's fast, lightweight, and handles input validation with Pydantic.
// Structuring the Project
The project separates the model artifact from the application code:
ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl
// Writing the App
The following app loads the model once at startup and exposes a /predict endpoint:
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np
app = FastAPI(title="Sales Forecast API")
# Load the model once at startup
with open("model.pkl", "rb") as f:
model = pickle.load(f)
class PredictRequest(BaseModel):
region: str
month: int
marketing_spend: float
units_in_stock: int
class PredictResponse(BaseModel):
region: str
predicted_revenue: float
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
try:
features = [[
request.month,
request.marketing_spend,
request.units_in_stock
]]
prediction = model.predict(features)
return PredictResponse(
region=request.region,
predicted_revenue=round(float(prediction[0]), 2)
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
The PredictRequest class does the input validation for you. If someone sends a request with a missing field or a string where a number is expected, FastAPI rejects it with a clear error message before your model code even runs. The model is loaded once at startup — not on every request — which keeps response times fast.
The /health endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check whether your service is actually up and ready.
// Defining the Dockerfile
This Dockerfile bakes the model directly into the image so the container is fully self-contained:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the model and the app together
COPY model.pkl .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
The model.pkl is baked into the image at build time. This means the container is completely self-contained, and you don't need to mount anything when you run it. The --host 0.0.0.0 flag tells Uvicorn to listen on all network interfaces inside the container, not just localhost. Without this, you won't be able to reach the API from outside the container.
// Building and Running
Build the image and start the API server:
docker build -t ml-api .
docker run --rm -p 8000:8000 ml-api
Test it with curl:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'
# Building a Multi-Service Pipeline with Docker Compose
Real data projects rarely involve just one process. You might need a database, a script that loads data into it, and a dashboard that reads from it — all running together.
Docker Compose lets you define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.
// Structuring the Project
The pipeline splits each service into its own subdirectory:
pipeline/
├── docker-compose.yml
├── loader/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── load_data.py
└── dashboard/
├── Dockerfile
├── requirements.txt
└── app.py
// Defining the Compose File
This Compose file declares all three services and wires them together with health checks and shared URL environment variables:
# docker-compose.yml
version: "3.9"
services:
db:
image: postgres:15
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: analytics
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U admin -d analytics"]
interval: 5s
retries: 5
loader:
build: ./loader
depends_on:
db:
condition: service_healthy
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
dashboard:
build: ./dashboard
depends_on:
db:
condition: service_healthy
ports:
- "8501:8501"
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
volumes:
pgdata:
// Writing the Loader Script
This script waits briefly for the database, then loads a CSV into the sales table using SQLAlchemy:
# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time
DATABASE_URL = os.environ["DATABASE_URL"]
# Give the DB a moment to be fully ready
time.sleep(3)
engine = create_engine(DATABASE_URL)
df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)
print(f"Loaded {len(df)} rows into the sales table.")
Let’s take a closer look at the Compose file. Each service runs in its own container, but they're all on the same Docker-managed network, so they can reach each other using the service name as a hostname. The loader connects to db:5432 — and not localhost — because db is the service name, and Docker handles the DNS resolution automatically.
The healthcheck on the PostgreSQL service is important. depends_on alone only waits for the container to start, not for PostgreSQL to be ready to accept connections. The healthcheck uses pg_isready to confirm the database is actually up before the loader tries to connect. The pgdata volume persists the database between runs; stopping and restarting the pipeline won't wipe your data.
// Starting Everything
Bring up all services with a single command:
docker compose up --build
To stop everything, run:
docker compose down
# Scheduling Jobs with a Cron Container
Sometimes you need a script to run on a schedule. Maybe it fetches data from an API every hour and writes it to a database or a file. You don't want to set up a full orchestration system like Airflow for something this simple. A cron container does the job cleanly.
// Structuring the Project
The project includes a crontab file alongside the script and Dockerfile:
data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab
// Writing the Fetch Script
This script uses Requests to hit an API endpoint and saves the results as a timestamped CSV:
# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os
API_URL = "https://api.example.com/sales/latest"
OUTPUT_DIR = "/app/output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"[{datetime.now()}] Fetching data...")
response = requests.get(API_URL, timeout=10)
response.raise_for_status()
data = response.json()
df = pd.DataFrame(data["records"])
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)
print(f"[{datetime.now()}] Saved {len(df)} records to {output_path}")
// Defining the Crontab
The crontab schedules the script to run every hour and redirects all output to a log file:
# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1
The >> /var/log/fetch.log 2>&1 part redirects both standard output and error output to a log file. This is how you inspect what happened after the fact.
// Defining the Dockerfile
This Dockerfile installs cron, registers the schedule, and keeps it running in the foreground:
FROM python:3.11-slim
# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job
# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job
# cron -f runs cron in the foreground, which is required for Docker
CMD ["cron", "-f"]
The cron -f flag is important here. Docker keeps a container alive as long as its main process is running. If cron ran in the background (its default), the main process would exit immediately and Docker would stop the container. The -f flag keeps cron running in the foreground so the container stays alive.
// Building and Running
Build the image and start the container in detached mode:
docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher
Check the logs any time:
docker exec fetcher cat /var/log/fetch.log
The output folder is mounted from your local machine, so the CSV files land on your filesystem even though the script runs inside the container.
# Wrapping Up
I hope you found this Docker article helpful. Docker doesn't have to be complicated. Start with the first example, swap in your own script and dependencies, and get comfortable with the build-run cycle. Once you've done that, the other patterns follow naturally. Docker is a good fit when:
- You need reproducible environments across machines or team members
- You're sharing scripts or models that have specific dependency requirements
- You're building multi-service systems that need to run together reliably
- You want to deploy anywhere without setup friction
That said, you don’t always need to use Docker for all of your Python work. It's probably overkill when:
- You're doing quick, exploratory analysis only for yourself
- Your script has no external dependencies beyond the standard library
- You're early in a project and your requirements are changing rapidly
If you're interested in going further, check out 5 Simple Steps to Mastering Docker for Data Science.
Happy coding!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.