10 Essential Docker Commands for Data Engineering

Tired of 'it works on my machine' problems? Learn the top 10 Docker commands every data engineer needs to build, deploy, and scale projects like a pro!



Essential Docker Commands for Data Engineering
Image by Author | Canva

 

Docker is basically a tool that helps data engineers package, distribute, and run applications in a consistent environment. Instead of manually installing stuff (and praying it works everywhere), you just wrap your entire project—code, tools, dependencies into lightweight, portable, and self-sufficient environments called containers. These containers can run your code anywhere, whether on your laptop, a server, or the cloud. For example, if your project needs Python, Spark, and a bunch of specific libraries, instead of manually installing them on every machine, you can just spin up a Docker container with everything pre-configured. Share it with your team, and they’ll have the exact same setup running in no time. Before we discuss the essential commands, let’s go over some key Docker terminology to make sure we’re all on the same page.

  • Docker Image: A snapshot of an environment with all dependencies installed.
  • Docker Container: A running instance of a Docker image.
  • Dockerfile: A script that defines how a Docker image should be built.
  • Docker Hub: A public registry where you can find and share Docker images.

Before using Docker, you'll need to install:

  • Docker Desktop: Download and install it from Docker’s official website. You can check if it is installed correctly by running the following command:
  • docker --version

     

  • Visual Studio Code: Install it from here and add the Docker extension for easy management.

Here are the essential Docker commands that every data engineer should know:

 

1. docker run

 
What It Does: Creates and starts a container from an image.

docker run -d --name postgres -e POSTGRES_PASSWORD=secret -v pgdata:/var/lib/postgresql/data postgres:15

 
Why It’s Important: Data engineers frequently launch databases, processing engines, or API services. The docker run command’s flags are critical:

  • d: Runs the container in the background (so your terminal isn’t locked).
  • -name: Name your container. Stop guessing which random ID is your Postgres instance.
  • e: Set environment variables (like passwords or configs).
  • p: Maps ports (e.g., exposing PostgreSQL’s port 5432).
  • v: Mounts volumes to persist data beyond the container’s lifecycle.

Without volumes, database data would vanish when the container stops—a disaster for production pipelines.

 

2. docker build

 
What It Does: Turn your Dockerfile into a reusable image.

# Dockerfile
FROM python:3.9-slim
RUN pip install pandas numpy apache-airflow

 

docker build -t custom_airflow:latest .

 
Why It’s Important: Data engineers often need custom images preloaded with tools like Airflow, PySpark, or machine learning libraries. The docker build command ensures teams use identical environments, eliminating "works on my machine" issues.

 

3. docker exec

 
What It Does: Executes a command inside a running container.

docker exec -it postgres_db psql -U postgres  # Access PostgreSQL shell

 
Why It’s Important: Data engineers use this to inspect databases, run ad-hoc queries, or test scripts without restarting containers. The -it flags lets you type commands interactively (without this, you’re stuck in read-only mode).

 

4. docker logs

 
What It Does: Displays logs from a container.

docker logs --tail 100 -f airflow_scheduler  # Stream last 100 logs

 
Why It’s Important: Debugging failed tasks (e.g., Airflow DAGs or Spark jobs) relies on logs. The -f flag streams logs in real-time, helping diagnose runtime issues.

 

5. docker stats

 
What It Does: Live dashboard for CPU, memory, and network usage of containers.

docker stats postgres spark_master

 
Why It’s Important: Efficient resource monitoring is important for maintaining optimal performance in data pipelines. For example, if a data pipeline experiences slow processing, checking docker stats can reveal whether PostgreSQL is overutilizing CPU resources or if a Spark worker is consuming excessive memory, allowing for timely optimization.

 

6. docker-compose up

 
What It Does: Start multi-container applications using a docker-compose.yml file.

# docker-compose.yml
services:
  airflow:
    image: apache/airflow:2.6.0
    ports:
      - "8080:8080"
  postgres:
    image: postgres:14
    volumes:
      - pgdata:/var/lib/postgresql/data

 

docker-compose up -d

 
Why It’s Important: Data pipelines often involve interconnected services (e.g., Airflow + PostgreSQL + Redis). Compose simplifies defining and managing these dependencies in a single declarative file so you don’t run 10 commands manually. The d flag allows you to run containers in the background (detached mode).

 

7. docker volume

 
What It Does: Manages persistent storage for containers.

docker volume create etl_data
docker run -v etl_data:/data -d my_etl_tool

 
Why It’s Important: Volumes preserve critical data (e.g., CSV files, database tables) even if containers crash. They’re also used to share data between containers (e.g., Spark and Hadoop).

 

8. docker pull

 
What It Does: Download an image from Docker Hub (or another registry).

docker pull apache/spark:3.4.1  # Pre-built Spark image

 
Why It’s Important: Pre-built images save hours of setup time. Official images for tools like Spark, Kafka, or Jupyter are regularly updated and optimized.

 

9. docker stop / docker rm

 
What It Does: Stop and remove containers.

docker stop airflow_worker && docker rm airflow_worker  # Cleanup

 
Why It’s Important: Data engineers test pipelines iteratively. Stopping and removing old containers prevents resource leaks and keeps environments clean.

 

10. docker system prune

 
What It Does: Clean up unused containers, images, and volumes to free resources.

docker system prune -a --volumes

 
Why It’s Important: Over time, Docker environments accumulate unused images, stopped containers, and dangling volumes (Docker volume that is no longer associated with any container), which eats disk space and slow down performance. This command reclaims gigabytes after weeks of testing.

  • a: Removes all unused images
  • -volumes: Delete volumes too (careful—this can delete data!).

Mastering these Docker commands empowers data engineers to deploy reproducible pipelines, streamline collaboration, and troubleshoot effectively. Do you have a favorite Docker command that you use in your daily workflow? Let us know in the comments!
 
 

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!