Docker for Data Science
Coming from a statistics background I used to care very little about how to install software and would occasionally spend a few days trying to resolve system configuration issues. Enter the god-send Docker almighty.
Docker for Data Science
Docker is a tool that simplifies the installation process for software engineers. Coming from a statistics background I used to care very little about how to install software and would occasionally spend a few days trying to resolve system configuration issues. Enter the god-send Docker almighty.
Think of Docker as a light virtual machine (I apologise to the Docker gurus for using that term). Generally someone writes a *Dockerfile* that builds a *Docker Image* which contains most of the tools and libraries that you need for a project. You can use this as a base and add any other dependencies that are required for your project. Its underlying philosophy is that if it works on my machine it will work on yours.
What’s in it for Data Scientists?
- Time: The amount of time that you save on not installing packages in itself makes this framework worth it.
- Reproducible Research: I think of Docker as akin to setting the random number seed in a report. The same dependencies and versions of libraries that was used in your machine are used on the other person’s machine. This makes sure that the analysis that you are generating will run on any other analysts machine.
- Distribution: Not only are you distributing your code, but you are also distributing the environment in which that code was run.
How Does it Work?
Docker employs the concept of (reusable) layers. So whatever line that you write inside the
Dockerfile is considered a layer. For example you would usually start with:
This Dockerfile would install
python3 (as a layer) on top of the
What you essentially do is for each project you write all the
pip install etc. commands into your Dockerfile instead of executing it locally.
I recommend reading the tutorial on https://docs.docker.com/get-started/ to get started on Docker. The learning curve is minimal (2 days work at most) and the gains are enormous.
Lastly Dockerhub deserves a special mention. Personally Dockerhub is what makes Docker truly powerful. It’s what github is to git, a open platform to share your Docker images. You can always construct a Docker image locally using
docker build ... but it is always good to
push this image to Dockerhub so that the next person simply has to
pull for personal use.
Personally I have started including a Dockerfile in most if not all of my github repo’s. Especially considering it means that I would never have to deal with installation issues.
Docker is one of the tools that as a software engineer (and now data scientists/ analysts) should have in their repertoire (with almost the same regard and respect as git). For a long time statisticians and Data Scientists have ignored the software aspect of data analysis. Considering how simple and intuitive it has become to use Docker there really is no excuse for not making it part of your software development pipeline.
If you are after a bit more substantial tutorial than the quick tips provided above see this video (jump to 4:30ish):
Edit 2 (A quick note on virtualenvs for python, packrat for R etc.):
Personally I have not used any of the other containerising tools, however it should be noted that Docker is independent of python and R, and goes beyond containerising applications for specific programming languages.
If you are enjoying my tutorials/ blog posts, consider supporting me on https://www.patreon.com/deepschoolio or by subscribing to my YouTube channel https://www.youtube.com/user/sachinabey (or both!). Oh and clap! :)
Original. Reposted with permission.
- DeepSchool.io: Deep Learning Learning
- Data Science Deployments With Docker
- Jupyter+Spark+Mesos: An “Opinionated” Docker Image