Adventures in MLOps with Github Actions,, Label Studio and NBDEV

This article documents the authors' experience building their custom MLOps approach.

By Aaron Soellinger & Will Kunz

When designing the MLOps stack for our project, we needed a solution that allowed for a high degree of customization and flexibility to evolve as our experimentation dictated. We considered large platforms that encompassed many functions, but found it limiting in some key areas. Ultimately we decided on an approach where separate specialized tools were implemented for labeling, data versioning, and continuous integration. This article documents our experience building this custom MLOps approach.

Photo by Finding Dan | Dan Grinwis on Unsplash




(Taken from


The classic problem using Jupyter for development was moving from prototype to production required copy/pasting code from a notebook to a python module. NBDEV automates the transition between notebook and module, thus enabling the Jupyter notebook to be an official part of a production pipeline. NBDEV allows the developer to state which module a notebook should create, which notebook cells to push to the module and which notebook cells are tests. A key capability of NBDEV is its approach to testing within the notebooks, and the NBDEV template even provides a base Github Action to implement testing in the CI/CD framework. The resulting Python module requires no editing by the developer, and can easily be integrated into other notebooks or the project at large using built-in python import functionality. DVC/CML


(Taken from


The files used in machine learning pipelines are often large archives of binary/compressed files, which are not accessible or cost prohibitive for existing version control solutions like git. DVC solves data versioning by representing large datasets as a hash of the file contents which enables DVC to track changes. It works similar to git (e.g. dvc adddvc push). When you run dvc add on your dataset, it gets added to the .gitignore and tracked for changes by dvc. CML is a project that provides functionality for publishing model artifacts from Github Actions workflows into comments attached Github Issues, Pull Requests, etc... That is important because it helps us start to fill in the gaps in the Pull Requests accounting for training data changes and resulting model accuracy and effectiveness.


Github Actions


(Taken from


We want automated code testing, including building models in the automated testing pipeline. Github Actions is in competition with CircleCI, Travis, Jenkins, which is to automate testing around code pushes, commits, pull requests, etc. Since we’re already using Github to host our repos, we avoid another 3rd party app by using Actions. In this project we need to use Github self-hosted runners to run jobs on an on-prem GPU cluster.


Label Studio


(Taken from


We did a deep dive into how we’re using Label Studio found here. Label Studio is a solution for labeling data. It works well, and is flexible to run in a variety of environments.


Why use them together?

The setup is designed to deploy models faster. That means, more data scientists working harmoniously in parallel, transparency in the repository and faster onboarding time for new people. The goal is to standardize the types of activities that data scientists need to do in project and provide clear instructions for them.

The following is a list of tasks we want to streamline with this system design:

  1. Automate the ingest from Label Studio and provide a single point for ingesting that into the model training and evaluation activities.
  2. Automated testing on the data pipeline code, that is unit testing and re-deployment of containers used by the process.
  3. Automated testing on the model code, that is unit testing and re-deployment of containers used by the process.
  4. Enable automated testing to include model re-training and evaluation criteria. When the model code changes, train a model with the new code and compare it to the existing incumbent model.
  5. Trigger model retraining when training data changes.

Below is the description of pipeline for each task.


Traditional CI/CD Pipeline

This pipeline implements automated testing feedback for each pull request that includes evaluation of syntax, unit, regression and integration tests. The outcome of this process is a functionally tested docker image to our private repository. This process maximizes the likelihood that the latest best code is in a fully tested image available in the repository for downstream tasks. Here’s how the developer lifecycle works in the context of a new feature:

Here we show how the workflow function for while editing the code. Using NBDEV enables us to work directly from the Jupyter notebooks including writing the tests directly in the notebook. NBDEV requires that all the cells in the notebooks run without exception (unless the cell is flagged not to run). (Image by Author)


Data pipeline

Label Studio currently lacks event hooks enabling updates on-changes to the label data stored. So we take a cron triggered approach, updating the dataset every hour. Additionally, while the label studio training dataset is small enough, the updates can be done as part of the training pipeline as well. We have the ability to trigger the data pipeline refresh on demand using the Github Actions interface.

The data pipeline feeds from Label Studio, and persists every version of the dataset and relevant inputs to the DVC cache stored in AWS S3. (Image by Author)


Model Pipeline

The modeling pipeline integrates model training into the CI/CD pipeline for the repository. This enables each pull request to evaluate the syntax, unit, integration and regression tests configured on the codebase, but also can provide feedback that includes evaluating the new resulting model

The workflow in this case, run the model training experiment specified in the configuration file (model_params.yaml) and update the model artifact (best-model.pth) (Image by Author)


Benchmark Evaluation Pipeline

The benchmarking pipeline forms an “official submission” process to ensure all modeling activities are measured against the metrics of the project.

The newly trained model in best-model.pth is evaluated against the benchmark dataset and the results are tagged with the latest commit hash and persisted in AWS S3. (Image by Author)



Here is the DAG definition file that is used by DVC. It captures the workflow steps and their inputs, and allows for reproducibility across users and machines.

    cmd: python pipelines/ --config_fp pipelines/traditional_pipeline.yaml
      --ls_token *** --proj_root "."
    - pipelines/traditional_pipeline.yaml:
      - src.out_fp
      - src.proj_id
    cmd: python pipelines/ --config_fp pipelines/create_traditional.yaml
      --proj_root "."
    - data/raw_labels/traditional.json
    - pipelines/create_traditional.yaml:
      - dataset.bmdata_fp
      - dataset.labels_map
      - dataset.out_fp
      - dataset.rawdata_dir
    cmd: python pipelines/ --config_fp pipelines/model_params.yaml
      --proj_root "."
    - data/traditional_labeling
    - pipelines/model_params.yaml:
      - dataloader.size
      - dataloader.train_fp
      - dataloader.valid_fp
      - learner.backbone
      - learner.data_dir
      - learner.in_checkpoint
      - learner.metrics
      - learner.n_out
      - learner.wandb_project_name
      - train.cycles
    cmd: python pipelines/ --config_fp pipelines/benchmark_pipeline.yaml
      --ls_token *** --proj_root "."
    - pipelines/benchmark_pipeline.yaml:
      - src.out_fp
      - src.proj_id
    cmd: python pipelines/ --config_fp pipelines/create_benchmark.yaml
      --proj_root "."
    - data/raw_labels/benchmark.json
    - pipelines/create_benchmark.yaml:
      - dataset.bmdata_fp
      - dataset.labels_map
      - dataset.out_fp
      - dataset.rawdata_dir
    cmd: python pipelines/ --config_fp pipelines/bench_eval.yaml --proj_root
    - data/models/best-model.pth
    - pipelines/bench_eval.yaml:
      - eval.bench_fp
      - eval.label_config
      - eval.metrics_fp
      - eval.model_conf
      - eval.overlay_dir





  1. The Github Actions workflow cron trigger is not extremely reliable. It does not guarantee timing.
  2. DVC does not work in a clear manner inside a Github Action workflow that is triggered on push. It will alter the trackers that are source controlled and when that is committed it will create another Github action.
  3. The Github Actions orchestration as a mechanism to run model requires a self-hosted runner to use a GPU. This means connecting to a GPU instance in the cloud or on-prem, and this presents issues with access control. For example, we can’t open source the exact repo without removing that self-hosted runner configuration from the repo or else random people would be able to run workloads on our training server by pushing code to the project.
  4. NBDEV built-in workflow is testing the code in the wrong place. It’s testing the notebook instead of the compiled package. On the one hand, it’s nice to be able to say that the “tests can be written right into the notebook”. On the other hand, testing the notebooks directly tests leaves open the possibility that the code package created by NBDEV fails even though the notebook ran. What we need is the ability to test the NBDEV-compiled package directly
  5. NBDEV doesn’t interoperate with “traditional” Python development in the sense that NBDEV is a one-way street. It simply allows the project to be developed in the interactive Jupyter notebook style. It makes it impossible to develop the Python modules directly. If at any point, the project wants to be converted to “traditional” Python development testing would need to be accomplished another way.
  6. In the beginning, we were using Weights & Biases as our experiment tracking dashboard, however there were issues deploying it into a Github Action. What we can say is that the user experience for implementing wandb hit its first hiccup in the Action Workflow. Removing Weights & Biases resolved the problem straight away. Before that, wandb stood out as the best user experience in MLOps.



Ultimately, it took one week to complete the implementation of these tools for managing our code with Github Actions, tools (DVC & CML) and NBDEV. This provides us with the following capabilities:

  1. Work from Jupyter notebooks as the system of record for the code. We like Jupyter. The main use case it accomplishes is to enable us to work directly on any hardware we can SSH into by hosting a Jupyter server there and forwarding it to a desktop. To be clear, we would be doing this even if we were not using NBDev because the alternative is using Vim or some such tool that we don’t like as much. Past experiments to connect to remote servers with VS Code or Pycharm failed. So it’s Jupyter.
  2. Testing the code, and testing the model it creates. Now as part of the CI/CD pipeline we can evaluate whether or not the model resulting from the changes to the repo make the model better, worse or stay the same. This is all available in the pull request before it is merged into main.
  3. Using Github Actions server as an orchestrator for training runs begins to allow multiple data scientists to work simultaneously in a more clear manner. Going forward, we will see the limitations of this setup for orchestrating the collaborative data science process.

Aaron Soellinger has formerly worked as a data scientist and software engineer solving problems in finance, predictive maintenance and sports. He currently works as a machine learning systems consultant with Hoplabs working on a multi-camera computer vision application.

Will Kunz is a back end software developer, bringing a can-do attitude and dogged determination to challenges. It doesn't matter if it's tracking down an elusive bug or adapting quickly to a new technology. If there's a solution, Will wants to find it.

Original. Reposted with permission.