90% of Today’s Code is Written to Prevent Failure, and That’s a Problem
Trying to anticipate and defend against these failures is the constant uphill battle that today’s engineers are up against. But it doesn’t have to be.
Photo by cottonbro
No data scientist or engineer wants to learn that their code failed. For example, your data may have arrived malformed. The database might have gone down. Or, the computer running the code failed. There are an infinite number of reasons for failure that can prevent your code from achieving its prescribed objective. Trying to anticipate and defend against these failures is the constant uphill battle that today’s engineers are up against. But it doesn’t have to be, if we only change our perspective and frameworks to say, “Failures will happen—and that’s okay.”
Positive vs. Negative Engineering
Productive code written to achieve an intended outcome can be categorized as positive engineering. Companies hire engineers with the aim of writing positive code and building solutions to meet business-critical and analytical objectives. Conversely, negative engineering is the defensive code written to provide preemptive guardrails against anticipated failures and ensure that the positive code actually runs.
Trying to anticipate the multitude of ways things can go wrong makes writing negative code both a time-consuming and frustrating endeavor, with engineers reporting that they typically spend 90% of their time on negative or defensive issues alone. This time unfortunately detracts from overall productivity and primary objective and leaves just 10% for the positive solutions that engineers were hired to build.
Engineers may think that such a defensive approach is necessary to the success of their positive code and is helping to reduce the risks of failure. But as someone who has spent their entire career in risk management, I can say that effective risk mitigation doesn’t attempt to predict and hedge every possible result. Instead, one develops a framework with useful tools specific enough to tackle the task at hand, but generalized enough to handle the unknown. It’s about enabling a negative engineering framework where users can work with failure, rather than against it.
Insurance for Positive Engineering
I liken a negative engineering framework to home insurance. Purchasing home insurance won’t prevent your house from burning down or flooding, but it can significantly reduce the financial burden of the ramifications. Similarly, implementing a negative engineering framework won’t prevent failures from occurring, but it can reduce the time, headache and talent required to identify and correct the problem.
An effective framework encompasses proper instrumentation, observability, and orchestration of code. As insurance for your positive code, it allows you to go beyond just scheduling a process to run to knowing if that process failed in order to insure your solutions’ outcomes. A failed code may require a simple retry or for engineers to go in and quickly modify the code to mitigate failure cases. The problem may be—and often is—a seemingly trivial issue but one that can cause a butterfly effect of problems if left unchecked. For instance, failures with mission-critical analytics pipelines may result in companies making strategic decisions from bad data.
It’s important to note that a negative engineering framework takes more than being able to capture error messages. Users need to set clear expectations of what successful execution of the positive code looks like. For example, if a process running on a computer suddenly crashes, the machine may not even have the opportunity to send an error message that the failure occurred. But by creating a logic where the framework can infer that a process failed when the expectation wasn’t met, engineers can derive defensive insights and take action on any deviations from the set objectives—even in face of the unexpected.
Task Failed Successfully
The fact is, no matter how much negative code you write or how many safeguards you try to put in place, failures are an inevitable part of the life of an engineer. Moreover, the catalyst for a failure may be outside of your control. Take for instance a real-life situation that happened with a data team at a high-growth startup. The team of five engineers was managing an analytics stack when their entire infrastructure completely failed, leaving them with reports full of errors.
Starting from their broken dashboard and working backward, the team discovered one error after another, at nearly every layer of their stack. The team eventually found this was because each stage was passing its failure to the subsequent stages, resulting in chain reaction of failures at each step. After three days of investigative work, the team uncovered the source of their failures: it was as trivial as the credit card attached to a SaaS vendor had expired. As the vendor’s API was integrated with an early stage of the team’s pipeline, the billing error created a domino effect of failures down to the end-user dashboard.
Once they found the problem, the team was fortunately able to resolve it. All they needed was awareness of the root cause and the solution was simple. Defensive, negative code wouldn’t have been able to prevent such a failure—given the catalyst was external—but an insurance framework would be able to mitigate its impact. A good workflow tool could identify the root failure and prevent the downstream tasks from executing at all, with the expectation that if they did it would only result in a wave of failures. This visibility would have helped the five-member team whittle three days of investigation time down to a matter of minutes and raised productivity dramatically.
Productivity gains are perhaps the greatest benefit to a negative engineering framework where engineers come to accept the occurrence of failures and have the tools to mitigate their impacts. Rather than spending 90% of their time on trying to prevent failures, they can dedicate more time to writing positive code that drives business outcomes. Even bringing the hours spent on negative code down to 80% would mean raising positive engineering time from 10% to 20%—effectively doubling productivity time. That’s an extraordinary gain from just a shift in mindset. Engineers can thereby confidently ensure their positive codes run as designed and manage the unknown and all the risks it brings.