How a Level System can Help Forecast AI Costs
To forecast costs for AI systems, it can be useful to talk about their “level” just like SAE has levels for self-driving cars. Adopting a level system can help organizations plan and prepare for AI systems that scale in complexity over time.
Photo by Hitesh Choudhary on Unsplash
Designing and building AI systems is difficult. Unlike traditional software where the majority of the costs are in the development process before the systems are deployed, with AI systems, most of the costs occur after. The behavior of AI systems is learned, potentially changing from its initial deployment. Machine learning models degrade over time without ongoing investment in data and hyperparameter tuning. And design decisions directly affect the ability to scale AI systems. A core part of this design difficulty is understanding how they change (or don’t change!) over time.
When building AI systems, it can be useful to talk about their “level”, just like SAE has levels for self-driving cars. Adopting a level system can help organizations plan and prepare for AI systems that scale in complexity over time. Levels can provide core breakpoints for how different AI systems can behave. Employing levels – and making trade-offs between levels – can help provide a shorthand for differences post-deployment.
It’s critical to understand what kind of behavioral changes the system might undergo, and to factor that into the design of the system. The leveling framework below outlines core differences in how systems will change over time: we can use these when both designing a system and operating it. Different components can be at different levels; having an intuition of how they are different can help inform planning and execution.
System complexity is defined by the scope of its (a) inputs, (b) outputs, and (c) objectives.
In general, there is increasing value as you move up levels, e.g. one goal might be to move a system operating at Level 1 to be at Level 2 – but complexity (and cost) of system build also increases as levels go up. It can make a lot of sense to start with a novel feature at a “low” level, where the system behavior is well understood, and progressively increase the level - as understanding the failure cases of the system becomes more difficult as the level increases.
The focus should be on learning about the problem and the solution space. Lower levels are more consistent and can be much better avenues to explore possible solutions than higher levels, whose cost and variability in performance can be large hindrances.
Levels of AI Systems provide breakpoints that dramatically affect system cost, in a progression as we move from traditional software (Level 0) up to fully Intelligent software (Level 4). Systems at Level 4 essentially maintain and improve on their own - they require negligible work from in-house development teams.
Moving up a level has trade-offs. For example, moving from Level 1 to Level 2 reduces ongoing data requirements and customization work, but introduces a self-reinforcing bias problem. Choosing to move up a level requires recognizing the new challenges, and the actions to take in designing our AI system.
There are significant benefits in scalability (and typically performance/robustness/etc) in moving up levels. We should recognize the benefits, and costs; when we work on a project at level N, we should consider the work to get to N+1. We should target the level appropriate for what we are trying to achieve, and recognize when an existing AI system needs to be rebuilt to change levels.
Level 0: Deterministic
No required training data, no required testing data
Algorithms that involve no learning (e.g. adapting parameters to data) are at level zero.
The great benefit of level 0 (traditional algorithms in computer science) is that they are very reliable and, if you solve the problem, can be shown to be the optimal solution. If you can solve a problem at level 0 it’s hard to beat. In some respect, all algorithms - even sorting algorithms (like binary search) - are "adaptive" to the data. We do not generally consider sorting algorithms to be "learning". Learning involves memory - the system changing how it behaves in the future, based on what it's learned in the past.
However, some problems defy a pre-specified algorithmic solution. The downside is that for problems that defy human understanding (either once, or in number) it can be difficult to perform well (e.g. speech to text, translation, image recognition, utterance suggestion, etc.).
- Luhn Algorithm for credit card validation
- Regex-based systems (e.g. simple redaction systems for credit card numbers).
- Information retrieval algorithms like TFIDF retrieval or BM25.
- Dictionary-based spell correction.
Note: In some cases, there can be a small number of parameters to tune. For example, ElasticSearch provides the ability to modify BM25 parameters. We can regard these as tuning parameters, i.e. set and forget. This is a blurry line.
Level 1: Learned
Static training data, static testing data
Systems where you train the model in an offline setting and deploy to production with “frozen” weights. There may be an updating cadence to the model (e.g. adding more annotated data), but the environment the model operates in does not affect the model.
The benefit of level 1 is that you can learn and deploy any function at the modest cost of some training data. This is a great place to experiment with different types of solutions. And, for problems with common elements (e.g. speech recognition) you can benefit from diminishing marginal costs.
The downside is that customization to a single use case is linear in their number: you need to curate training data for each use case. And that can change over time, so you need to continuously add annotations to preserve performance. This cost can be hard to bear.
- Custom text classification models
- Speech to text (acoustic model)
Level 2: Self-learning
Dynamic + static training data, static testing data
Systems that use training data generated from the system for the model to improve. In some cases, the data generation is independent of the model (so we expect increasing model performance over time as more data is added); in other cases, the model intervening can reinforce model biases and performance can get worse over time. To eliminate the chance of reinforcing biases, we need to evaluate new models on static (potentially annotated) data sets.
Level 2 is great because performance seems to improve over time for free. The downside is that, left unattended, the system can get worse - it may not be consistent in getting better with more data. The other limitation is that some systems at level two might have limited capacity to improve as they essentially feed on themselves (generating their own training data); addressing this bias can be challenging.
- Naive spam filters
- Common speech to text models (language model)
Level 3: Autonomous (or self-correcting)
Dynamic training data, dynamic test data
Systems that both alter human behavior (e.g. recommend an action and let the user opt-in) and learn directly from that behavior, including how the systems' choice changes the user behavior. Moving from Level 2 to 3 potentially represents a big increase in system reliability and total achievable performance.
Level 3 is great because it can consistently get better over time. However, it is more complex: it might require truly staggering amounts of data, or a very carefully designed setup, to do better than simpler systems; its ability to adapt to the environment also makes it very hard to debug. It is also possible to have truly catastrophic feedback loops. For example, a human corrects an email spam filter - however, because the human can only ever correct misclassifications that the system made, it learns that all its predictions are wrong and inverts its own predictions.
Level 4: Intelligent (or globally optimizing)
Dynamic training data, dynamic test data, dynamic goal
Systems that both dynamically interact with an environment and globally optimizes (e.g. towards some set of downstream objectives), e.g. facilitating an agent while optimizing for AHT and CST, or optimizing directly for profit. For example, an AutoSuggest model that does not optimize for the next click (current approach) but for the best series of clicks to optimize the conversation.
Level 4 has awesome promise - it is not always obvious how to get there, and unless carefully designed, these systems can optimize towards degenerate solutions. Aiming them at the right problem, shaping the reward, and auditing its behavior are large and non-trivial tasks.
Appendix: Matrix Layout
We can visualize the levels as a matrix as well.
|Algorithms that involve no learning (e.g. no adapting parameters to data) are at level zero.||No training data.||General outputs.||No objective target. Metrics (for performance).|
|Systems where model training is in an offline setting and is deployed to production with “frozen” weights. There may be an updating cadence to the model (e.g. adding more annotated data), but the data used to train the model is not generated directly by the system.||Static data, often annotated.||Simple output (simple function approximation)||Single objective, mapping from input data to output.|
|Systems that use training data generated from the system for the model to improve, ideally where the data is stationary (so we expect increasing model performance over time as more data is added).||Retraining using new model inputs generated from the system.||Simple output, proximate to the input data.||Single objective, mapping from input data to output.|
|Systems that both alter human behavior and learn directly from that behavior. Problems in this category often involve bandit learning paradigms such as exploration vs. exploitation.||System is retraining using new model input and explicit feedback on the system’s previous outputs.||Policy to update model outputs over time.||Cumulative objective(s), capturing how the model introduces bias and how people interact with the system.|
|Systems that dynamically interact with an environment and optimizes itself towards downstream objectives, e.g. facilitating an agent while optimizing for AHT and CST. Problems in this category sometimes involve reinforcement learning paradigms.||System looks at downstream impact of model decisions and optimizes for entire system performance.||Policy to optimize the entire system.||System objectives, downstream from local decisions.|
Michael Griffiths is the director of data scientist at ASAPP. He works to identify opportunities to improve the customer and agent experience. Prior to ASAPP, Michael spent time in advertising, ecommerce, and management consulting.