How Docker Can Help You Become A More Effective Data Scientist
I wrote this quick primer so you don’t have to parse all the information out there and instead can learn the things you need to know to quickly get started.
Build your Docker image
Phew, that was lots of information about Dockerfiles. Don’t worry, everything else is fairly straightforward from here. Now that we have created our recipe in the form of a DockerFile, its time to build an image. You can accomplish this via the following command:
Also available on Github
This will build a docker image (not a container, read the terminology in the beginning of this post if you don’t remember what the difference is!), which you can then run at a later time.
Create and run a container from your Docker image
Now you are ready to put all this magic to work. We can bring up this environment by executing the following command:
Also available on Github
After you run this, your container will be up and running! The jupyter server will be up in running because of the
command at the end of the Dockerfile. Now you should be able to access your jupyter notebook on the port it is serving on — in this example it should be accessible from http://localhost:7745/ with the password tutorial. If you are running this docker container remotely, you will have to setup local port forwarding so that you can access the jupyter server from your browser.
Interacting With Your Container
Once your container is up and running, these commands will come in handy:
- Attach a new terminal session to a container. This is useful if you need to install some software or use the shell.
- Save the state of your container as a new image. Even though you started out with a Dockerfile with all the libraries you wanted to install, over time you may significantly change the state of the container by adding more libraries and packages interactively. It is useful to save the state of your container as an image that you can later share or layer on top of. You can accomplish this by using the docker commit CLI command:
For example, if I wanted to save the state of the container called container1 as an image called hamelsmu/tutorial:v2, I would simply run this command:
You might be wondering why hamelsmu/ is in front of the image name — this just makes it easier to push this container to DockerHub later on, as hamelsmu is my DockerHub username (more about this later). If you are using Docker at work, it is likely that there is an internal private Docker repo that you can push your Docker images to instead.
- List running containers. I often use this when I have forgot the name of the container that is currently running.
If you run the above command without the status=running flag, then you will see a list of all the containers (even if they are no longer running) on your system. This can be useful for tracking down an old container.
- List all images that you have saved locally.
- Push your image to DockerHub (or another registry). This is useful if you want to share your work with others, or conveniently save an image in the cloud. Be careful that you do not share any private information while doing this (there are private repos available on DockerHub, too).
First create a DockerHub repository and name your image appropriately, as described here. This will involve running the command docker login to first connect to your account on DockerHub or other registry. For example, to push an image to this container, I first have to name my local image as hamelsmu/tutorial (I can choose any tag name) For example, the CLI command:
Pushes the aforementioned docker image to this repository with the tag v2. It should be noted that if you make your image publicly available others can simply layer on top of your image just like we added layers to the ubuntuimage in this tutorial. This is quite useful for other people seeking to reproduce or extend your research.
Now You Have Superpowers
Now that you know how to operate Docker, you can perform the following tasks:
- Share reproducible research with colleagues and friends.
- Win Kaggle competitions without going broke, by moving your code to larger compute environments temporarily as needed.
- Prototype locally inside a docker container on your laptop, and then seamlessly move the same computation to a server without breaking a sweat, while taking many of the things you love about your local environment with you (your aliases, vim plugins, bash scripts, customized prompts, etc).
- Quickly instantiate all the dependencies required to run Tensorflow, Pytorch or other deep learning libraries on a GPU computer using Nvidia-Docker (which can be painful if you are doing this from scratch). See the bonus section below for more info.
- Publish your models as applications, for example as a rest api that serves predictions from a docker container. When your application is Dockerized, it can be trivially replicated as many times as needed.
We only scratched the surface of Docker, and there is so much more you can do. I focused on the areas of Docker that I think you will encounter most often as a Data Scientist and hopefully gave you enough confidence to start using it. Below are some resources that helped me on my Docker journey:
- Helpful Docker commands
- More helpful Docker commands
- Dockerfile reference
- How to create and push to a repository on DockerHub
The original motivation for me to learn Docker in the first place was to prototype deep learning models on a single GPU and move computation to AWS once I needed more horsepower. I was also taking the excellent course Fast.AI by Jeremy Howard and wanted to share prototypes with others.
However, to properly encapsulate all the dependencies such as drivers for Nvidia GPUs you will need to use Nvidia-Docker instead of Docker. This requires a little bit more work than using vanilla Docker but is straight forward once you understand Docker.
I have placed my Nvidia-Docker setup in this repo, and will leave this as an exercise for the reader.
Get In Touch!
Bio: Hamel Husain is a Data Scientist with over 10 years of experience who specializes in Machine Learning. Hamel has worked at Airbnb, DataRobot, AlixPartners and Accenture, and has successfully built predictive models at scale at many companies. His favorite data science tools include Keras, Sklearn, Vowpal Wabbit, H2O and DataRobot. He currently lives in San Francisco, CA and can be found on Linkedin (here) and Twitter (here).
Original. Reposted with permission.
- Automated Machine Learning — A Paradigm Shift That Accelerates Data Scientist Productivity
- Docker for Data Science
- Data Science Deployments With Docker