Nitpicking Machine Learning Technical Debt
Technical Debt in software development is pervasive. With machine learning engineering maturing, this classic trouble is unsurprisingly rearing its ugly head. These 25 best practices, first described in 2015 and promptly overshadowed by shiny new ML techniques, are updated for 2020 and ready for you to follow -- and lead the way to better ML code and processes in your organization.
Google Research's mess of a github repo (or the top 5th, at least), and this is just the stuff that's public. And no, the README isn't equally detailed. Quite the opposite, in fact.
Best Practice #12: Set regular checks and criteria for removing code, or put the code in a directory or on a disk far-removed from the business-critical stuff.
Speaking of old code, you know what software engineering has had for a while now? Really great abstractions! Everything from the concept of relational databases to views in web pages. There are entire branches of applied category theory devoted to figuring out the best ways to organize code like this. Do you know what applied category theory hasn’t quite caught up to yet? That’s right, machine learning code organization. Software engineering has had decades of throwing abstraction spaghetti at the wall and seeing what sticks. Machine learning? Aside from Map-Reduce (which is like, not as impressive relational databases) or Async Parameter servers (which nobody can agree on how this should be done), or sync allreduce (which just sucks for most use-cases), we don’t have much to show.
High-level overview of the previously mentioned Sync AllReduce server architecture (or a seriously oversimplified version).
High-level overview of the previously mentioned Async parameter server architecture (or at least one version of it).
In fact, between groups doing research on random networks and Pytorch advertising how fluid the nodes in their neural networks are, Machine Learning has been throwing the abstraction spaghetti clean out the window! I don’t think the authors realized that this problem was going to get MUCH worse as time went on. My recommendation? Read more of the literature on the popular high-level abstractions, and maybe don’t use PyTorch for production code.
Best Practice #13: Stay up-to-date on abstractions that are becoming more solidified with time.
I’ve met plenty of senior machine learning engineers who have pet frameworks that they like to use for most problems. I’ve also seen many of the same engineers watch their favorite framework fall to pieces when applied to a new context, or get replaced by another framework that’s functionally indistinguishable. This is especially prevalent in teams doing anything with distributed machine learning. I want to make something absolutely clear:
Aside from MapReduce, you should avoid getting too attached to any single framework. If your “senior” machine learning engineer believes with all their being that Michelangelo is the bee’s knees and will solve everything, they’re probably not all that senior. While ML engineering has matured, it’s still relatively new. An actually “senior” senior ML engineer will probably focus on making workflows that are framework agnostic, since they know most of those frameworks are not long for this world. ⚰️
Now, the previous sections mentioned a bunch of distinct scenarios and qualities of technical debt in ML, but they also provide examples of higher-level anti-patterns for ML development.
Most of you reading this have probably heard the phrase code-smells going around. You’ve probably used tools like good-smell or Pep8-auto-checking (or even the hot new Black auto-formatter that everyone is using on their production python code). Truth be told I don’t like this term “code smell”. “Smell” always seems to imply something subtle, but the patterns described in the next section are pretty blatant. Nonetheless, the authors list a few types of code smells that indicate a high level of debt (beyond the usual types of code smells). For some reason, they only started listing the code-smells halfway into the section on code smells.
The “Plain data” smell You may have code that’s dealing with a lot of data in the form of numpy floats. There may be little information preserved about the nature of the data, such as whether your RNA read counts represent samples from a Bernoulli distribution, or whether your float is a log of a number. They don’t mention this in the Tech Debt Paper, but this is one area where using typing in Python can help out. Avoiding unnecessary use of floats, or floats with too much precision will go a long way. Again, using the built-in Decimal or Typing packages will help a lot (and not just for code navigation but also speedups on CPUs).
Best Practice #14: Use packages like Typing and Decimal, and don’t use ‘float32’ for all data objects.
A majority of hackathon code, and unfortunately a lot of VC-funded "AI" startup code, looks like this under the hood.
The “Prototyping” smell Anyone that’s been in a hackathon knows that code slapped together in under 24 hours has a certain look to it. This ties back into the unused experimental code mentioned earlier. Yes, you might be all excited to try out the new PHATE dimensionality reduction tool for biological data, but either clean up your code or throw it out.
Best Practice #15: Don’t leave all works-in-progress in the same directory. Clean it up or toss it out.
The “Multi-language” smell Speaking of language typing, multi-language-codebases act almost like a multiplier for technical debt and make it pile up much faster. Sure, these languages all have their benefits. Python is great for building ideas fast. JavaScript is great for interfaces. C++ is great for graphics and making computations go fast. PHP…uhhh…okay maybe not that one. Golang is useful if you’re working with Kubernetes (and you work at Google). But if you’re making these languages talk to each other, then there will be a lot of spots for things to go wrong, whether it be broken endpoints or memory leaks. At least in machine learning, there are a few toolkits like Spark and Tensorflow that have similar semantics between languages. If you absolutely must use multiple languages, at least we now have that going for us post-2015.
C++ programmer switching to Python. Now picture an entire repo written by this person.
Best Practice #16: Make sure endpoints are accounted for, and use frameworks that have similar abstractions between languages.
(Calling this a code-smell was a weird choice, as this is a pretty blatant pattern even by the standard of usual code-smells)
Part 6: Configuration Debt (boring but easy to fix)
The “Configuration debt” section of the Tech Debt Paper is probably the least exciting one, but the problem it describes is the easiest to fix. Basically, this is just the practice of making sure all the tunable and configurable information about your machine learning pipeline is in one place, and that you don’t have to go searching through multiple directories just to figure out how many units your second LSTM layer had. Even if you’ve gotten into the habit of creating config files, the packages and technologies haven’t all caught up with you. Aside from some general principles, this part of the Tech Debt Paper doesn’t go into too much detail. I suspect that the authors of the Tech Debt Paper were more used to packages like Caffe (in which case yes, setting up configs with Caffe protobufs was objectively buggy and terrible).
Personally, I would suggest using a framework like tf.Keras or Chainer, if you’re going to be setting up configuration files. Most cloud services have some version of configuration management, but outside of that, you should at least be prepared to use a config.json file or parameter flags in your code.
Best Practice #17: Make it so you can set your file paths, hyperparameters, layer type and layer order, and other settings from one location.
A FANTASTIC example of a config file. Really wish the Tech Debt Paper had gone into more examples, or even pointed to some of the existing configuration file guides at the time.
If you’re going to be tuning these settings with a command line, try to use a package like Click instead of Argparse.
Part 7: The real world dashing your dreams of solving this
Section 7 acknowledges that a lot of managing tech debt is preparing for the fact that you’re dealing with a constantly changing real world. For example, you might have a model where there’s some kind of decision threshold for converting a model output into a classification or picking a True or False Boolean. Any group or company that works with biological or health data is familiar with how diagnosis criteria can change rapidly. You shouldn’t assume the thresholds you work with will last forever, especially if you’re doing anything with bayesian machine learning.
Your decision boundaries in your online-learning algorithm... 12 minutes after deployment.
Best Practice #18: Monitor the models’ real-world performance and decision boundaries constantly.
The section stresses the importance of real-time monitoring; I can definitely get behind this. As for which things to monitor, the paper’s not a comprehensive guide, but they give a few examples. One is to compare the summary statistics for your predicted labels with the summary statistics of the observed labels. It’s not foolproof, but it’s like checking a small animal’s weight. If something’s very wrong there, it can alert you to a separate problem very quickly.
Far too many tools to count. Everyone and their mother is making a monitoring startup these days. Here's SageMaker & Weights & Biases as examples since they're slightly less buggy than the others.
Best Practice #19: Make sure the distribution of predicted labels is similar to the distribution of observed labels.
If your system is making any kind of real-world decisions, you probably want to put some kind of rate limiter on it. Even if your system is NOT being trusted with millions of dollars for bidding on stocks, even if it’s just to alert you that something’s not right with the cell culture incubators, you will regret not setting some kind of action limit per unit of time.
Some of your earliest headaches with production ML will be with systems with no rate limiters.
Best Practice #20: Put limits on real-world decisions that can be made by machine learning systems.
You also want to be mindful of any changes with upstream producers of the data your ML pipeline is consuming. For example, any company running machine learning on human blood or DNA samples obviously wants to make sure those samples are all collected with a standardized procedure. If a bunch of samples is all coming from a certain demographic, the company should make sure that won’t skew their analysis. If you’re doing some kind of single-cell sequencing on cultured human cells, you want to make sure you’re not confusing cancer cells dying due to a drug working with, say, an intern accidentally letting the cell culture dehydrate. The authors say ideally you want a system that can respond to these changes (e.g., logging, turning itself off, changing decision thresholds, alert a technician, or whoever does repairs) even when humans aren’t available.
You laugh now, but you might be more at risk than you realize for making this kind of mistake.
Best Practice #21: Check assumptions behind input data.
Part 8: The weirdly meta section
The penultimate section of the Tech Debt Paper goes on to mention other areas. The authors previously mentioned failure of abstraction as a type of technical debt, and apparently, that extends to the authors not being able to fit all these technical debt types into the first 7 sections of the paper.
Sanity Checks
Moving on, it’s critically important to have sanity checks on the data. If you’re training a new model, you want to make sure your model is at least capable of overfitting to one type of category in the data. If it’s not converging on anything, you might want to check that the data isn’t random noise before tuning those hyperparameters. The authors weren’t that specific, but I figured that was a good test to mention.
Best Practice #22: Make sure your data isn’t all noise and no signal by making sure your model is at least capable of overfitting.
Reproducibility
Reproducibility. I’m sure many of you on the research team have had a lot of encounters with this one. You’ve probably seen code without seed numbers, notebooks written out of order, repositories without package versions. Since the Tech Debt Paper was written, a few have tried making reproducibility checklists. Here’s a pretty good one that was featured on hacker news about 4 months ago.
This is a pretty great one that I use, and encourage teams I work with to use.
Best Practice #23: Use reproducibility checklists when releasing research code.
Industry people reconstructing code from academia know what I'm talking about.
Process Management
Most of the types of technical debt discussed so far have referred to single machine learning models, but process management debt is what happens when you’re running tons of models at the same time, and you don’t have any plans for stopping all of them from waiting around for the one laggard to finish. It’s important not to ignore the system-level smells, and this is where checking the runtimes of your models becomes extremely important. Machine learning engineering is at least improving at thinking about high-level system design since the Tech Debt Paper’s writing.
An example of the types of charts used in bottleneck identification.
Best Practice #24: Make a habit of checking and comparing runtimes for machine learning models.
Cultural Debt
Cultural debt is the really tricky type of debt. The authors point out that sometimes there’s a divide between research and engineering, and that it’s easier to encourage debt-correcting behavior in heterogeneous teams.
Personally, I’m not exactly a fan of that last part. I’ve witnessed many teams that have individuals that end up reporting to both the engineering directors and the research director. Making a subset of the engineers report to two different branches without the authority to make needed changes is not a solution for technical debt. It’s a solution insofar as a small subset of engineers take the brunt of the technical debt. The end result is that such engineers usually end up with No Authority Gauntlet Syndrome (NAGS), burn out, and are fired by whichever manager had the least of their objectives fulfilled by the engineer, all while the most sympathetic managers are out at Burning Man. If heterogeneity helps, then it needs to be across the entire team.
Plus, I think the authors make some of the same mistakes many do when talking about team or company culture. Specifically, confusing culture with values. It’s really easy to list a few aspirational rules for a company or team and call them a culture. You don’t need an MBA to do that, but these are more values than actual culture. Culture is what people end up doing when they’re in situations that demand they choose between two otherwise weighted values. This was what got Uber in so much trouble. Both competitiveness and honesty were part of their corporate values, but in the end, their culture demanded they emphasized competitiveness over everything else, even if that meant HR violating laws to keep absolute creeps at the company.
A vicious cycle.
The issue with tech debt is that it comes up in a similar situation. Yes, it’s easy to talk about how much you want maintainable code. But, if everyone’s racing for a deadline and writing documentation keeps getting shifted down in priority on the JIRA board, that debt is going to pile up despite your best efforts.
The various reasons for technical debt at the highest level.
Best Practice #25: Set aside regular, non-negotiable time for dealing with technical debt (whatever form it might take).
Part 9: A technical debt litmus test
It’s important to remember that the ‘debt’ part is just a metaphor. As much as the authors try to make this seem like something that has more rigor, that’s all it is. Unlike most debts, machine learning technical debt is something that’s hard to measure. How fast your team is moving at any given time is usually a poor indicator of how much you have (despite what many fresh-out-of-college product managers seem to insist). Rather than a metric, the authors suggest 5 questions to ask yourself (paraphrased for clarity here):
- How long would it take to get an algorithm from an arbitrary NeurIPS paper running on your biggest data source?
- Which data dependencies touch the most (or fewest) parts of your code?
- How much can you predict the outcome of changing one part of your system?
- Is your ML model improvement system zero-sum or positive-sum?
- Do you even have documentation? Is there a lot of hand-holding through the ramping up process for new people?
Of course, since 2015, other articles and papers have tried coming up with more precise scoring mechanisms (like scoring rubrics). Some of these have the benefit of being able to create an at-a-glance scoring mechanisms that, even if imprecise, will help you track technical debt over time. Also, there’s been a ton of advancements in the Interpretable ML tools that were extolled as a solution to some types of technical debt. With that in mind, I’m going to recommend “Interpretable Machine Learning” by Christoph Molnar (available online here) again.
The 25 Best Practices in one place
Here are all the Best Practices I mentioned throughout in one spot. There are likely many more than this, but tools for fixing technical debt follow the Pareto Principle: 20% of the technical debt remedies can fix 80% of your problems.
- Use interpretability tools like SHAP values.
- Use explainable model types if possible.
- Always re-train downstream models.
- Set up access keys, directory permissions, and service-level-agreements.
- Use a data versioning tool.
- Drop unused files, extraneous correlated features, and maybe use a causal inference toolkit.
- Use any of the countless DevOps tools that track data dependencies.
- Check independence assumptions behind models (and work closely with security engineers.
- Use regular code-reviews (and/or use automatic code-sniffing tools).
- Repackage general-purpose dependencies into specific APIs.
- Get rid of Pipeline jungles with top-down redesign/reimplementation.
- Set regular checks and criteria for removing code, or put the code in a directory or on a disk far-removed from the business-critical stuff.
- Stay up-to-date on abstractions that are becoming more solidified with time.
- Use packages like Typing and Decimal, and don’t use ‘float32’ for all data objects.
- Don’t leave all works-in-progress in the same directory. Clean it up or toss it out.
- Make sure endpoints are accounted, and use frameworks that have similar abstractions between languages.
- Make it so you can set your file paths, hyperparameters, layer type and layer order, and other settings from one location.
- Monitor the models’ real-world performance and decision boundaries constantly.
- Make sure the distribution of predicted labels is similar to the distribution of observed labels.
- Put limits on real-world decisions that can be made by machine learning systems.
- Check assumptions behind input data.
- Make sure your data isn’t all noise and no signal by making sure your model is at least capable of overfitting.
- Use reproducibility checklists when releasing research code.
- Make a habit of checking and comparing runtimes for machine learning models.
- Set aside regular, non-negotiable time for dealing with technical debt (whatever form it might take).
Original. Reposted with permission.
Related: