Nitpicking Machine Learning Technical Debt

Technical Debt in software development is pervasive. With machine learning engineering maturing, this classic trouble is unsurprisingly rearing its ugly head. These 25 best practices, first described in 2015 and promptly overshadowed by shiny new ML techniques, are updated for 2020 and ready for you to follow -- and lead the way to better ML code and processes in your organization.

By Matthew McAteer, researcher turned Probabilistic Machine Learning Engineer.

I recently revisited the paper Hidden Technical Debt in Machine Learning Systems (Sculley et al. 2015) (which I’ll refer to as the Tech Debt Paper throughout this post for the sake of brevity and clarity). This was a paper shown at NeurIPS 2015, but it sort of fell to the background because at the time, everyone was swooning over projects based on this new “Generative Adversarial Networks” technique from Ian Goodfellow.

Now the Tech Debt Paper is making a comeback. At the time of writing this, there have been 25 papers citing this in the last 75 days. This is understandable, as machine learning has gotten to the point where we need to worry about technical debt. However, if a lot of people are going to be citing this paper (if not for more than just citing all the papers that have the phrase “machine learning technical debt” in them), we should at least be aware of which parts have and have not stood the test of time. With that in mind, I figured it would save a lot of time and trouble for everyone involved to write up which parts are outdated, and point out the novel methods that have superseded them. Having worked at companies ranging from fast-growing startups to large companies like Google (the company of the Tech Debt Paper authors), and seeing the same machine learning technical debt mistakes being made everywhere, I felt qualified to comment on this.

Notice that ML Code is the tiny and insignificant black box. One of the good parts of the Tech Debt Paper, though sadly many reimplementations of this image don't have the different sizes of the boxes, and thus miss the point.

This post covers some of the relevant points of the Tech Debt Paper, while also giving additional advice on top that’s not 5 years out of date. Some of this advice is in the form of tools that didn’t exist back then…and then some are in the form of tools/techniques that definitely did exist that the authors missed a huge opportunity by not bringing up.




Tech debt is an analogy for the long-term buildup of costs when engineers make design choices for speed of deployment over everything else. Fixing technical debt can take a lot of work. It’s the stuff that turns “Move fast and break things” into “Oh no, we went too fast and gotta clean some of this up.”

Less catchy, I can understand why Mark Zuckerburg was less forthcoming about the second slogan.

Okay, we know technical debt in software is bad, but the authors of this paper assert that technical debt for ML systems specifically is even worse. The Tech Debt Paper proposes a few types of tech debt in ML and for some of them a few solutions (like how there are different recycling bins, different types of garbage code need different approaches 🚮). Given that the Tech Debt Paper was an opinion piece that was originally meant to get people’s attention, it’s important to note several pieces of advice from this work that may no longer be relevant, or may have better solutions in the modern day.


Part 1: ML tech debt is worse than you thought


You’re all probably familiar by now with technical debt. The Tech Debt Paper starts with a clarification that by technical debt, we’re not referring to adding new capabilities to existing code. This is the less glamorous task of writing unit tests, improving readability, adding documentation, getting rid of unused sections, and other such tasks for the sake of making future development easier. Well, since standard software engineering is a subset of the skills needed in machine learning engineering, more familiar software engineering tech debt is just a subset of the space of possible ML tech debt.

I'm paraphrasing, but this is pretty much what Sculley et al. 2015 saying.


Part 2: The Nebulous Nature of Machine Learning


The Tech Debt Paper section after the intro goes into detail about how the nebulous nature of machine learning models makes dealing with tech debt harder. A big part of avoiding or correcting technical debt is making sure the code is properly organized and segregated. The fact is we often use machine learning in cases where precise rules or needs are super hard to specify in real code. Instead of hardcoding the rules to turn data into outputs, more often than not, we’re trying to give an algorithm the data and the outputs (and sometimes not even that) to output the rules. We don’t even know what the rules that need segregation and organizing are.

Best Practice #1: Use interpretability/explainability tools.

This is where the problem of entanglement comes in. Basically, if you change anything about a model, you risk changing the performance of the whole system. For example, taking a 100-feature model on health records for individuals and adding a 101st feature (like, you’re suddenly listing whether or not they smoked weed). Everything’s connected. It’s almost like dealing with a chaotic system (ironically enough, a few mathematicians have tried to describe neural networks as chaotic attractors as though they were double pendulums or weather systems).

This, but there are hundreds of strings, and the blinds can go OUT the window too.

The authors suggest a few possible fixes like ensembling models or high-dimensional visualization tools, but even these fall short if any of the ensembled model outputs are correlated, or if the data is too high-dimensional. A lot of the recommendations for interpretable ML are a bit vague. With that in mind, I recommend checking out Facebook’s high-dimensional visualization tool, as well as reading by far the best resource I’ve seen on interpretable machine learning: “Interpretable Machine Learning” by Christoph Molnar (available online here)

Sometimes using more explainable model types, like decision trees, can help with this entanglement problem, but the jury’s still out for best practices for solving this for Neural networks.

Best Practice #2: Use explainable model types if possible.

Correction Cascades are what happens when some of the inputs to your nebulous machine learning model are themselves nebulous machine learning models. It’s just setting up this big domino rally of errors. It is extremely tempting to set up sequences of models like this, for example, when applying a pre-existing model to a new domain (or a “startup pivot” as so many insist on calling it). You might have an unsupervised dimensionality reduction step right before your random forest, but changing the t-SNE parameters suddenly tanks the performance of the rest of the model. In the worst case scenario, it’s impossible to improve any of the subcomponents without detracting from the performance of the entire system. Your machine learning pipeline goes from being positive sum to zero sum (that’s not a term from the Tech Debt Paper, I just felt like not adding it in was a missed opportunity).

Courtesy Randal Munroe of XKCD.

As far as preventing this, one of the better techniques is a variant of greedy unsupervised layer-wise pretraining (or GULP). There’s still some disagreement on the mathematical reasons WHY this works so well, but basically you train the early models or early parts of your ensembles, freeze them, and then work your way up the rest of the sequence (again, not mentioning this in the Tech Debt Paper was another missed opportunity, especially since the technique has existed at least since 2007).

Best Practice #3: Always re-train downstream models in order.

Another inconvenient feature of machine learning models: more consumers might be relying on the outputs than you realize, beyond just other machine learning models. This is what the authors refer to as Undeclared Consumers. The issue here isn’t that the output data is unstructured or not formatted right, it’s that nobody’s taking stock of just how many systems depend on the outputs. For example, there are plenty of custom datasets on sites like Kaggle, many of which are themselves machine learning model outputs. A lot of projects and startups will often use datasets like this to build and train their initial machine learning models in lieu of having internal datasets of their own. Scripts and tasks that are dependent on these can find their data sources changing with little notice. The problem is compounded for APIs that don’t require any kind of sign-in to access data. Unless you have some kind of barrier to entry for accessing the model outputs, like access keys or service-level agreements, this is a pretty tricky one to handle. You may be just saving your model outputs to a file, and then someone else on the team may decide to use those outputs for a model of their own because, hey, why not, the data’s in the shared directory. Even if it’s experimental code, you should be careful about who’s accessing model outputs that aren’t verified yet. This tends to be a big problem with toolkits like JupyterLab (if I could go back in time and add any kind of warning to the Tech Debt Paper, it would be a warning about JupyterLab).

Basically fixing this type of technical debt involves cooperation between machine learning engineers and security engineers.

Best Practice #4: Set up access keys, directory permissions, and service-level-agreements.

You wish your downstream consumer graph was as simple or as comprehensive as this.


Part 3: Data Dependencies (on top of regular dependencies)


The third section goes a bit deeper with data dependency issues. More bad news: in addition to the regular code dependencies of software engineering, machine learning systems will also depend on large data sources that are probably more unstable than the developers realize.

For example, your input data might take the form of a lookup table that’s changing underneath you, or a continuous data stream, or you might be using data from an API you don’t even own. Imagine if the host of the MolNet dataset decided to update it with more accurate numbers (ignoring for a moment how they would do this for a moment). While the data may reflect reality more accurately, countless models have been built against the old data, and many of the makers will suddenly find that their accuracy is tanking when they re-run a notebook that definitely worked just last week.

One of the proposals by the authors is to use data dependency tracking tools like Photon for versioning. That being said, in 2020, we also have newer tools like DVC, which literally just stands for “Data Version Control”, that make Photon obsolete for the most part. It behaves much the same way as git, and saves a DAG keeping track of the changes in a dataset/database. Two other great tools to be used together for versioning are Streamlit (for keeping track of experiments and prototypes) and Netflix’s Metaflow. How much version control you do will come down to a tradeoff between extra memory and avoiding some giant gap in the training process. Still, insufficient or inappropriate versioning will lead to enormous survivorship bias (and thus wasted potential) when it comes to model training.

This is the DVC page. Go download it. Seriously, go download it now!

Best Practice #5: Use a data versioning tool.

The data dependency horror show goes on. Compared to the unstable data dependencies, the underutilized ones might not seem as bad, but that’s how they get you! Basically, you need to keep a lookout for data that’s unused, data that was once used but is considered legacy now, and data that’s redundant because it’s heavily correlated with something else. If you’re managing a data pipeline where it turns out entire gigabytes are redundant, that will incur development costs on its own just as well.

The correlated data is especially tricky because you need to figure out which variable is the correlated one and which is the causative one. This is a big problem in biological data. Tools like ANCOVA are increasingly outdated, and they’re unfortunately being used in scenarios where some of the ANCOVA assumptions definitely don’t apply. A few groups have tried proposing alternatives like ONION and Domain Aware Neural Networks, but many of these are improving upon fairly unimpressive standard approaches. Some companies like Microsoft and QuantumBlack have come up with packages for causal disentanglement (DoWhy and CausalNex, respectively). I’m particularly fond of DeepMind’s work on Bayesian Causal Reasoning. Most of these were not around at the time of the Tech Debt Paper’s writing, and many of these packages have their own usability debt, but it’s important to make it known that ANCOVA is not a one-size-fits-all solution to this.

Best Practice #6: Drop unused files, extraneous correlated features, and maybe use a causal inference toolkit.

Anyway, the authors were a bit less pessimistic about the fixes for these. They suggested a static analysis of data dependencies, giving the one used by Google in their click-through predictions as an example. Since the Tech Debt Paper was published, the pool of options for addressing this has grown a lot. For example, there are tools like Snorkel, which lets you track which slices of data are being used for which experiments. Cloud Services like AWS and Azure have their own data dependency tracking services for DevOps, and there’s also tools like Red Gate SQL dependency tracker. So, yeah, looks like the authors were justified in being optimistic about that one.

Best Practice #7: Use any of the countless DevOps tools that track data dependencies.


Part 4: Frustratingly undefinable feedback loops


Now we had a bit of a hope spot in the previous section, but the bad news doesn’t just stop at data dependencies. Section 4 of the paper goes into how unchecked feedback loops can influence the machine learning development cycle. This can both refer to direct feedback loops like in semi-supervised learning or reinforcement learning, or indirect loops like engineers basing their design choices off of another machine learning output. This is one of the least defined issues in the Tech Debt Paper, but countless other organizations are working on this feedback loop problem, including what seems like the entirety of OpenAI (at least that’s what the “Long Term Safety” section of their charter, before all that “Capped Profit” hubbub). What I’m trying to say is that if you’re going to be doing research on direct or indirect feedback loops, you’ve got much better and more specific options than this paper.

This one goes back on track with solutions that seem a bit more hopeless than the last section. They give examples of bandit algorithms as being resistant to the direct feedback loops, but not only do those not scale, technical debt accumulates the most when you’re trying to build systems at scale. Useless. The indirect feedback fixes aren’t much better. In fact, the systems in the indirect feedback loop might not even be part of the same organization. This could be something like trading algorithms from different firms, each trying to meta-game each other, but instead causing a flash crash. Or a more relevant example in biotech, suppose you have a model that’s predicting the error likelihood for a variety of pieces of lab equipment. As time goes on, the actual error rate could go down because people have become more practiced with it, or possibly up because the scientists are using the equipment more frequently, but the calibrations haven’t increased in frequency to compensate. Ultimately, fixing this comes down to high-level design decisions, and making sure you check as many assumptions behind your model’s data (especially the independence assumption) as possible.

This is also an area where many principles and practices from security engineering become very useful (e.g., tracking the flow of data throughout a system, searching for ways the system can be abused before bad actors can make use of them).

These are just a few examples of direct feedback loops for one model. In reality, some of the blocks in this diagram may themselves be ML models. This doesn't scratch the surface of indirect interactions.

Best Practice #8: Check independence assumptions behind models (and work closely with security engineers).

By now, especially after the ANCOVA comments, you’re probably sensing a theme about testing assumptions. I wish this was something the authors devoted at least an entire section to.


Part 5: Common no-no patterns in your ML code


The “Anti-patterns” section of the Tech Debt Paper was a little more actionable than the last one. This part went into higher-level patterns that are much easier to spot than indirect-feedback loops.

(This is actually a table from the Tech Debt Paper, but with hyperlinks to actionable advice on how to fix them. This table was possibly redundant, as the authors discuss unique code smells and anti-patterns in ML, but these are all regular software engineering anti-patterns you should address in your code first.)

Design Defect Type
Feature Envy Code Smell
Data Class Code Smell
Long Method Code Smell
Long Parameter List Code Smell
Large Class Code Smell
God Class Antipattern
Swiss Army Knife Antipattern
Functional Decomposition Antipattern
Spaghetti Code Antipattern

The majority of these patterns revolve around the 90% or more of ML code that’s just maintaining the model. This is the plumbing that most people in a Kaggle competition might think doesn’t exist. Solving cell segmentation is a lot easier when you’re not spending most of your time digging through the code connecting the Tecan Evo camera to your model input.

Best Practice #9: Use regular code-reviews (and/or use automatic code-sniffing tools).

The first ML-anti-pattern introduced is called “glue code”. This is all the code you write when you’re trying to fit data or tools from a general-purpose package into a super-specific model that you have. Anyone that’s ever tried doing something with packages like RDKit knows what I’m talking about. Basically, most of the stuff you shove into the file can count as this (everyone does it). These (hopefully) should be fixable by repackaging these dependencies as more specific API endpoints.

We all have files we're not proud of.

Best Practice #10: Repackage general-purpose dependencies into specific APIs.

“Pipeline jungles” are a little bit trickier, as this is where a lot of glue code accumulates. This is where all the transformations that you add for every little new data source that piles up into an ugly amalgam. Unlike with Glue Code, the authors pretty much recommend letting go and redesigning codebases like this from scratch. I want to say this is something that has more options nowadays, but when glue code turns into pipeline jungles, even tools like Uber’s Michelangelo can become part of the problem.

Of course, the advantage of the authors’ advice is that you can make this replacement code seem like an exciting new project with a cool name that’s also an obligatory Tolkien reference, like “Balrog” (as yes, ignoring unfortunate implications of your project name isn’t just Palantir’s domain. You’re free to do that as well).

This, but in software form.

Best Practice #11: Get rid of Pipeline jungles with top-down redesign/reimplementation.

On the subject of letting go, experimental code. Yes, you thought you could just save that experimental code for later. You thought you could just put it in an unused function or unreferenced file, and it would be all fine. Unfortunately, stuff like this is part of why maintaining backward compatibility can be such a pain in the neck. Anyone that’s taken a deep dive into the Tensorflow framework can see the remains of frameworks that were only partially absorbedexperimental code, or even incomplete “TODO” code that was left for some other engineer to take care of at a later date. You probably first came across these while trying to debug your mysteriously failing Tensorflow code. This certainly puts all the compatibility hiccups between Tensorflow 1.X and 2.X in a new light. Do yourself a favor, and don’t put off pruning your codebase for 5 years. Keep doing experiments, but set some criteria for when to quarantine an experiment away from the rest of the code.