Torus for Docker-First Data Science
To help data science teams adopt Docker and apply DevOps best practices to streamline machine learning delivery pipelines, we open-sourced a toolkit based on the popular cookiecutter project structure.
By Alexander Ng, Manifold.ai
As interest in Artificial Intelligence (AI), and specifically Machine Learning (ML), grows and more engineers enter this popular field, the lack of de facto standards and frameworks for how work should be done is becoming more apparent. A new focus on optimizing the ML delivery pipeline is starting to gain momentum.
Data scientists are becoming more involved in the delivery pipeline of products, and it is a non-trivial task ensuring that their work survives the delivery process. Of course, this isn’t a new problem: in the past, traditional software development teams would throw their work “over the wall” to the operations team to serve in production with little to no context. A community effort to solve the inevitable mess resulted in what we now think of as DevOps, removing the wall between development and operations to drive increased efficiency and improve product quality. New tools and processes to help teams implement streamlined delivery pipelines now help guarantee development/production parity.
Now, the same problem has reared its head in the ML space, and is only getting worse as demand for AI products continues to grow. There is a new wall that is killing productivity. How does DevOps change with the rise of data science teams in engineering organizations? The pain points we are seeing in the community today feel familiar, but also have unique aspects to ML development.
Rise of the Machine Learning Engineer
Simply put, solutions in the DevOps space provide tools for people working at the intersection of development and operations. Similarly, there needs to be a toolkit for the person working at the intersection of data science and software engineering. We call that person a Machine Learning Engineer (MLE). At a high level, MLEs have the same set of challenges as any software engineer working in a product development team:
- Standardized local development environments
- Development vs. production environment parity
- Standardized packaging and deployment pipelines
In addition, certain aspects of the ML development workflow provide a different set of challenges to MLEs:
- Easily sharing development environments and intermediate results for conducting reproducible experiments
- Coordinating isolated project environments running multiple notebook servers
- Easily allowing for vertical and horizontal scaling to handle large datasets or leverage additional compute resources (g., for deep learning, optimization, etc.)
By looking through a ML development lens with the DevOps mentality, we can identify several new areas along the delivery path that need improving. There is an opportunity to build new tools and best practices that specifically empower the MLE community to deliver more robust solutions in a shorter amount of time.
Whatever the MLE toolkit ends up including, one thing we are very confident about is: Docker will play a major role in the ML development lifecycle standard.
Docker-First Data Science
By moving to a Docker-first workflow, MLEs can benefit from many of the significant downstream advantages in the development lifecycle in terms of easy vertical and horizontal scalability for running workloads on large datasets, as well as ease of deployment and delivery of models and prediction engines. Docker images running in containers provide an easy way to guarantee a consistent runtime environment across different developer laptops, remote compute clusters, and in production environments.
While this same consistency can be achieved with careful use of virtual environments and disciplined system-level configuration management, containers still provide a significant advantage in terms of spin up/down time for new environments and developer productivity. However, what we have heard repeatedly from the data science community is: I know Docker will make this easier, but I don’t have the time or resources to set it up and figure it all out.
At Manifold, we developed internal tools for easily spinning up Docker-based development environments for machine learning projects. In order to help other data science teams adopt Docker and apply DevOps best practices to streamline machine learning delivery pipelines, we open-sourced our evolving toolkit. We wanted to make it dead simple for teams to spin up new ready-to-go development environments and move to a Docker-first workflow.
How Does Torus Work?
The Torus 1.0 package contains a Dockerized Cookiecutter for Data Science (a fork of the popular cookiecutter-data-science) and an ML Development Base Docker Image. Using the project cookiecutter and Docker image together, you can go from cold-steel to a new project working in a Jupyter Notebook with all of the common libraries available in under five minutes (and you didn’t have to pip install anything).
After instantiating a new project with the cookiecutter template and running a single start command, your local development setup will look like this:
Fully configured out-of-the-box Dockerized local development setup for data science projects.
Let’s dive a little deeper into what’s happening here:
1. The ML base development image was pulled down to your local machine from Docker Hub. This includes many of the commonly used data science and ML libraries pre-installed, along with a Jupyter Notebook server with useful extensions installed and configured.
2. A container is launched with the base image, and is configured to mount your top-level project directory as a shared volume on the container. This lets you use your preferred IDE on your host machine to modify code and see changes reflected immediately in the runtime environment.
3. Port forwarding is set up so you can use a browser on your host machine to work with the notebook server running inside the container. An appropriate host port to forward is dynamically chosen, so no worries about port conflicts (e.g., other notebook servers, databases, or anything else running on your laptop).
4. The project is scaffolded with its own Dockerfile, so you can install any project-specific packages or libraries and share your environment with the team via source control.
You can use your favorite browser and IDE locally as you normally would to do your work, while your runtime environment is 100% consistent across your team. If you are working on multiple projects on your machine, rest assured that each project is running in its own cleanly isolated container.
Lay the Foundation
There is a lot of exciting activity going on in the MLE toolkit space and it’s easy to forget that, before even considering a higher-order platform or framework, you need to make sure your team is set up for success. We need what the DevOps movement did for software engineering in the ML delivery pipeline. Moving to a Docker-first development workflow is a great first step in making life easier for everyone involved with the delivery pipeline—and that includes your customers.
Bio: Alexander Ng is a Senior Data Engineer at Manifold, an AI product studio. His previous work includes a stint as engineer and technical lead doing DevOps at Kyruus, as well as engineering work for the Navy. He holds a BS degree from Boston University in Electrical Engineering.
- How Docker Can Help You Become A More Effective Data Scientist
- Docker for Data Science
- Operational Machine Learning: Seven Considerations for Successful MLOps