Taming Complexity in MLOps

A greatly expanded v2.0 of the open-source Orbyter toolkit helps data science teams continue to streamline machine learning delivery pipelines, with an emphasis on seamless deployment to production.

By Sourav Dey, CTO at Manifold.ai.

Almost two years ago now, my colleague Alexander Ng shared our thinking around why the rise of Machine Learning Engineers (MLEs) meant we needed a DevOps approach to machine learning (ML), and why our team created an open-source tool to make life easier for other teams doing AI and data science. The 1.0 release of Orbyter (called Torus at the time) focused on helping to streamline ML pipelines by making it dead simple to spin up a fully configured, out-of-the-box, Dockerized local development setup for data science projects.

For the just-released version 2.0, we focused on reducing incidental complexity as we continued to add features that make your life easier.


Making Best Practices Turnkey


Orbyter 2.0 has a built-in reduction of incidental complexity. If you're not already familiar with the dichotomy of inherent complexity versus incidental complexity in software, you can read a great explanation of it from David Hayes:

All complexity in software can be accurately thought of as falling into one of the two sides of this dichotomy of inherent and incidental. Take everything that sucks about your software, subtract out anything that sucks because it's a hard problem, and you're left with the (molehill or mountain, I'll let you choose) of incidental complexity...

Data science and AI are full of difficult problems, so writing ML software often comes with significant inherent complexity. We can't really address that in a tool; hard problems are going to be hard. However, we can bake reducing incidental complexity into the tool by being prescriptive about software engineering best practices, such as:

  • linting to PEP8 standards using Black
  • easy unit testing using pytest
  • centralized logging setup
  • adding hooks for continuous integration in GitHub
  • conforming to the data pipeline abstraction

For example, Orbyter 2.0 now allows you to automatically format your code to the Python PEP 8 standard using included local scripts. It also has ready-to-go continuous integration with GitHub Actions, so when you push your code to GitHub it automatically does a lint check and runs the unit tests. A YAML file provides a single place for you to configure robust logging. In addition, the code is scaffolded to conform to an isolated input/output architecture that allows rapid iteration in a data pipeline. We have also prescribed some clearly defined endpoints — train, predict, evaluate — as Click command line scripts that can be called as a Docker entry point.

Sample project architecture.

Essentially, we have tried to make good software engineering discipline as turnkey as possible. Your future self will thank you.


Tracking Your Experiments


While many organizations now have robust data science teams that do important experiments and modeling R&D, those same companies often struggle to move the results of that work from the lab to the factory. No one wants to see their hard work languish in a silo; the thrill is seeing your models make an impact in the real world. But it takes a lot of engineering work to make that transition happen, and many organizations are still ill-equipped for production deployment.

To address that disconnect, we've added a new container with MLflow. Orbyter 1.0 offered a single container and was a fork of the popular Cookiecutter Data Science, great for easily setting up a Dockerized workflow. Orbyter 2.0 is still fundamentally Dockerized data science, but now when you start it up, you get three different containers: the first offers Jupyter, as before; the second offers Bash, for running scripts; and the third offers MLflow, for experiment tracking.

We believe experiment tracking is a vital part of applied ML, and are big fans of using MLflow for that step. It can also help with packaging for reproducible runs and sending models to deployment.


Continuing to Lower the Bar to Entry


We started building Orbyter in the first place because we believe that Docker and containerization is the way forward. Alex explained it well in his initial post:

While this same consistency can be achieved with careful use of virtual environments and disciplined system-level configuration management, containers still provide a significant advantage in terms of spin up/down time for new environments and developer productivity. However, what we have heard repeatedly from the data science community is: I know Docker will make this easier, but I don't have the time or resources to set it up and figure it all out.

Orbyter was born to lower the bar for getting onto a Docker-first workflow. With the additional changes in Orbyter 2.0, moving to a Docker-first workflow is simpler than ever. Moving to this Dockerized workflow opens up the ability to use a variety of cloud tools—including SageMaker, Kubernetes, ECS, etc. For example:

  • We used AWS Batch to build an AWS experimentation-at-scale platform for a pharmaceutical customer. We did this quickly, thanks to being in Docker containers, and tracked them in MLflow. We would run jobs overnight and could look at them in the morning.
  • We have developed sophisticated CI/CD pipelines for several customers that automatically bake production Docker images after a Git tag and puts them into ECR, where they can easily be put into production.

Now that Orbyter itself is becoming more feature-rich, we're committed to keeping the onboarding as simple and straightforward as possible. We've created a demo repo that shows you how to use it—making publicly available a demo that we've presented, road-tested, and refined over the last year at the Strata Data Conference. You can find it here.

Keep an eye out for further updates as we continue to build out Orbyter and streamline the development process.