10 Essential Docker Commands for Data Engineering

Tired of 'it works on my machine' problems? Learn the top 10 Docker commands every data engineer needs to build, deploy, and scale projects like a pro!

By Kanwal Mehreen, KDnuggets Technical Editor & Content Specialist on February 25, 2025 in Data Engineering

Essential Docker Commands for Data Engineering

Image by Author | Canva

Docker is basically a tool that helps data engineers package, distribute, and run applications in a consistent environment. Instead of manually installing stuff (and praying it works everywhere), you just wrap your entire project—code, tools, dependencies into lightweight, portable, and self-sufficient environments called containers. These containers can run your code anywhere, whether on your laptop, a server, or the cloud. For example, if your project needs Python, Spark, and a bunch of specific libraries, instead of manually installing them on every machine, you can just spin up a Docker container with everything pre-configured. Share it with your team, and they’ll have the exact same setup running in no time. Before we discuss the essential commands, let’s go over some key Docker terminology to make sure we’re all on the same page.

Docker Image: A snapshot of an environment with all dependencies installed.
Docker Container: A running instance of a Docker image.
Dockerfile: A script that defines how a Docker image should be built.
Docker Hub: A public registry where you can find and share Docker images.

Before using Docker, you'll need to install:

Docker Desktop: Download and install it from Docker’s official website. You can check if it is installed correctly by running the following command:

docker --version

Visual Studio Code: Install it from here and add the Docker extension for easy management.

Here are the essential Docker commands that every data engineer should know:

1. `docker run`

What It Does: Creates and starts a container from an image.

docker run -d --name postgres -e POSTGRES_PASSWORD=secret -v pgdata:/var/lib/postgresql/data postgres:15

Why It’s Important: Data engineers frequently launch databases, processing engines, or API services. The docker run command’s flags are critical:

d: Runs the container in the background (so your terminal isn’t locked).
-name: Name your container. Stop guessing which random ID is your Postgres instance.
e: Set environment variables (like passwords or configs).
p: Maps ports (e.g., exposing PostgreSQL’s port 5432).
v: Mounts volumes to persist data beyond the container’s lifecycle.

Without volumes, database data would vanish when the container stops—a disaster for production pipelines.

2. `docker build`

What It Does: Turn your Dockerfile into a reusable image.

# Dockerfile
FROM python:3.9-slim
RUN pip install pandas numpy apache-airflow

docker build -t custom_airflow:latest .

Why It’s Important: Data engineers often need custom images preloaded with tools like Airflow, PySpark, or machine learning libraries. The docker build command ensures teams use identical environments, eliminating "works on my machine" issues.

3. `docker exec`

What It Does: Executes a command inside a running container.

docker exec -it postgres_db psql -U postgres  # Access PostgreSQL shell

Why It’s Important: Data engineers use this to inspect databases, run ad-hoc queries, or test scripts without restarting containers. The -it flags lets you type commands interactively (without this, you’re stuck in read-only mode).

4. `docker logs`

What It Does: Displays logs from a container.

docker logs --tail 100 -f airflow_scheduler  # Stream last 100 logs

Why It’s Important: Debugging failed tasks (e.g., Airflow DAGs or Spark jobs) relies on logs. The -f flag streams logs in real-time, helping diagnose runtime issues.

5. `docker stats`

What It Does: Live dashboard for CPU, memory, and network usage of containers.

docker stats postgres spark_master

Why It’s Important: Efficient resource monitoring is important for maintaining optimal performance in data pipelines. For example, if a data pipeline experiences slow processing, checking docker stats can reveal whether PostgreSQL is overutilizing CPU resources or if a Spark worker is consuming excessive memory, allowing for timely optimization.

6. `docker-compose up`

What It Does: Start multi-container applications using a docker-compose.yml file.

# docker-compose.yml
services:
  airflow:
    image: apache/airflow:2.6.0
    ports:
      - "8080:8080"
  postgres:
    image: postgres:14
    volumes:
      - pgdata:/var/lib/postgresql/data

docker-compose up -d

Why It’s Important: Data pipelines often involve interconnected services (e.g., Airflow + PostgreSQL + Redis). Compose simplifies defining and managing these dependencies in a single declarative file so you don’t run 10 commands manually. The d flag allows you to run containers in the background (detached mode).

7. `docker volume`

What It Does: Manages persistent storage for containers.

docker volume create etl_data
docker run -v etl_data:/data -d my_etl_tool

Why It’s Important: Volumes preserve critical data (e.g., CSV files, database tables) even if containers crash. They’re also used to share data between containers (e.g., Spark and Hadoop).

8. `docker pull`

What It Does: Download an image from Docker Hub (or another registry).

docker pull apache/spark:3.4.1  # Pre-built Spark image

Why It’s Important: Pre-built images save hours of setup time. Official images for tools like Spark, Kafka, or Jupyter are regularly updated and optimized.

9. `docker stop / docker rm`

What It Does: Stop and remove containers.

docker stop airflow_worker && docker rm airflow_worker  # Cleanup

Why It’s Important: Data engineers test pipelines iteratively. Stopping and removing old containers prevents resource leaks and keeps environments clean.

10. `docker system prune`

What It Does: Clean up unused containers, images, and volumes to free resources.

docker system prune -a --volumes

Why It’s Important: Over time, Docker environments accumulate unused images, stopped containers, and dangling volumes (Docker volume that is no longer associated with any container), which eats disk space and slow down performance. This command reclaims gigabytes after weeks of testing.

a: Removes all unused images
-volumes: Delete volumes too (careful—this can delete data!).

Mastering these Docker commands empowers data engineers to deploy reproducible pipelines, streamline collaboration, and troubleshoot effectively. Do you have a favorite Docker command that you use in your daily workflow? Let us know in the comments!

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

10 Essential Docker Commands for Data Engineering

1. `docker run`

2. `docker build`

3. `docker exec`

4. `docker logs`

5. `docker stats`

6. `docker-compose up`

7. `docker volume`

8. `docker pull`

9. `docker stop / docker rm`

10. `docker system prune`

More On This Topic

Latest Posts

Top Posts

10 Essential Docker Commands for Data Engineering

1. docker run

2. docker build

3. docker exec

4. docker logs

5. docker stats

6. docker-compose up

7. docker volume

8. docker pull

9. docker stop / docker rm

10. docker system prune

More On This Topic

Latest Posts

Top Posts

1. `docker run`

2. `docker build`

3. `docker exec`

4. `docker logs`

5. `docker stats`

6. `docker-compose up`

7. `docker volume`

8. `docker pull`

9. `docker stop / docker rm`

10. `docker system prune`