Creating A Simple Docker Data Science Image

This concise primer walks through setting up a Python data science environment using Docker, covering creating a Dockerfile, building an image, running a container, sharing and deploying images, and pushing to Docker Hub.



Creating A Simple Docker Data Science Setup
Image created by Author with Midjourney

 

Why Docker for Data Science?

 

As a data scientist, having a standardized and portable environment for analysis and modeling is crucial. Docker provides an excellent way to create reusable and sharable data science environments. In this article, we'll walk through the steps to set up a basic data science environment using Docker.

Why is it we would consider using Docker? Docker allows data scientists to create isolated and reproducible environments for their work. Some key advantages of using Docker include:

  • Consistency - The same environment can be replicated across different machines. No more "it works on my machine" issues.
  • Portability - Docker environments can easily be shared and deployed across multiple platforms.
  • Isolation - Containers isolate dependencies and libraries needed for different projects. No more conflicts!
  • Scalability - It's easy to scale an application built inside Docker by spinning up more containers.
  • Collaboration - Docker enables collaboration by allowing teams to share development environments.

 

Step 1: Creating a Dockerfile

 

The starting point for any Docker environment is the Dockerfile. This text file contains instructions for building the Docker image.

Let's create a basic Dockerfile for a Python data science environment and save it as 'Dockerfile' without an extension.

# Use official Python image
FROM python:3.9-slim-buster

# Set environment variable
ENV PYTHONUNBUFFERED 1

# Install Python libraries 
RUN pip install numpy pandas matplotlib scikit-learn jupyter

# Run Jupyter by default
CMD ["jupyter", "lab", "--ip='0.0.0.0'", "--allow-root"]

 

This Dockerfile uses the official Python image and installs some popular data science libraries on top of it. The last line defines the default command to run Jupyter Lab when a container is started.

 

Step 2: Building the Docker Image

 

Now we can build the image using the docker build command:

docker build -t ds-python .

 

This will create an image tagged ds-python based on our Dockerfile.

Building the image may take a few minutes as all the dependencies are installed. Once complete, we can view our local Docker images using docker images.

 

Step 3: Running a Container

 

With the image built, we can now launch a container:

docker run -p 8888:8888 ds-python

 

This will start a Jupyter Lab instance and map port 8888 on the host to 8888 in the container.

We can now navigate to localhost:8888 in a browser to access Jupyter and start running notebooks!

 

Step 4: Sharing and Deploying the Image

 

A key benefit of Docker is the ability to share and deploy images across environments.

To save an image to tar archive, run:

docker save -o ds-python.tar ds-python

 

This tarball can then be loaded on any other system with Docker installed via:

docker load -i ds-python.tar

 

We can also push images to a Docker registry like Docker Hub to share with others publicly or privately within an organization.

To push the image to Docker Hub:

  1. Create a Docker Hub account if you don't already have one
  2. Log in to Docker Hub from the command line using docker login
  3. Tag the image with your Docker Hub username: docker tag ds-python yourusername/ds-python
  4. Push the image: docker push yourusername/ds-python

The ds-python image is now hosted on Docker Hub. Other users can pull the image by running:

docker pull yourusername/ds-python

 

For private repositories, you can create an organization and add users. This allows you to share Docker images securely within teams.

 

Step 5: Loading and Running the Image

 

To load and run the Docker image on another system:

  1. Copy over the ds-python.tar file to the new system
  2. Load the image using docker load -i ds-python.tar
  3. Start a container using docker run -p 8888:8888 ds-python
  4. Access Jupyter Lab at localhost:8888

That's it! The ds-python image is now ready to use on the new system.

 

Final Thoughts

 

This gives you a quick primer on setting up a reproducible data science environment with Docker. Some additional best practices to consider:

  • Use smaller base images like Python slim to optimize image size
  • Leverage Docker volumes for data persistence and sharing
  • Follow security principles like avoiding running containers as root
  • Use Docker Compose for defining and running multi-container applications

I hope you find this intro helpful. Docker enables tons of possibilities for streamlining and scaling data science workflows.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.