The Myth of Model Interpretability

Deep networks are widely regarded as black boxes. But are they truly uninterpretable in any way that logistic regression is not?

Update: I have since refined these ideas in The Mythos of Model Interpretability, an academic paper presented at the 2016 ICML Workshop on Human Interpretability of Machine Learning.

There's a popular claim about the interpretability of machine learning models: Simple statistical models like logistic regression yield interpretable models. Neural networks, on the other hand, are black boxes. By this, it’s suggested that we can pass input in, and observe what comes out, but we lack the ability to reason about what happened in the middle.

To confirm the prevalence of this narrative, I ran a Google search for "neural network black box", yielding 2,410,000 results. By comparison, "logistic regression black box" turns up 600,000 results. Most of the latter articles mention logistic regression only to present them as the interpretable alternative to neural networks.


In a previous post, (Deep Learning's Deep Flaws)'s Deep Flaws, I looked at a wave of criticism of deep networks in which they were questioned on account of their susceptibility to misclassifying adversarially chosen examples. While this is indeed a problem, I showed how simpler methods like linear classifiers and decision trees can be similarly susceptible to this problem.

In a similar spirit, I'd like to question the notion that real-world, large-scale linear classifiers are necessarily more interpretable than neural networks. At the outset, we should acknowledge that what precisely constitutes interpretability is a fraught topic. Abundant discrepancies in the literature suggest that the very definition of the word encompasses a number of related and unrelated concepts. In this light, an epistemological study of the area warrants philosophical consideration. In this post, however, I'll keep things simple and focus on the black-box notion.

Interpret What?

Let’s begin by considering a logistic regression classifier trained to diagnose diabetes in a patient, given thousands of input features. For a well-defined problem like this, we can assume that any model predicts the diagnosis with some measurable accuracy. For any model under consideration, however black-box, its accuracy can be assessed. And still, even with known performance characteristics, some models may be considered interpretable while others may not. So clearly, interpretability, whatever we mean by it, and for whatever purpose we desire it, must be something other than accuracy.

One intuitive notion of an interpretation might be that it consists of some natural language explanation of a model's behavior. For example, for a positive prediction from the diabetes model, one interpretation might be "the model predicts you are likely to get diabetes because you have high sugar intake and large body mass index. However, the model gave you a slightly lower probability than it otherwise might have owing to your frequent exercise.”

Regarding this first notion of interpretability, we should question wether the interpretability of linear models may be overstated. To be sure, given the weights of a model, you could craft a process for extracting the components that made the largest contribution to a given prediction and summarize them with a verbal template. And for a small model with few features, this might convey a reasonable picture of what’s happening.

But what if the model had a million features? What if the evidence for the prediction is spread out among all the features? Scenarios like this emerge frequently. Consider high-dimensional datasets for genomics research. If our explanation generation process summed up the models behavior by explaining a few prominent features, would this faithfully describe the model's behavior?

Another problem is that such an interpretation might explain the behavior of the model but not give deep insight into the causal associations in the underlying data. That’s because linear models are subject to covariate effects through the process of feature selection. This can be problematic if you expect to understand anything about the underlying reality simply by a model's weights.

For the diabetes model, you might think that if the model for diabetes was trained properly, that sugar intake must be positively correlated with diabetes, and should thus get positive weight or that exercise must be negatively correlated, and should thus get negative weight. But given a much more predictive feature, say blood glucose levels, the model will shift weight to the better feature.

This can happen because features may be highly correlated, and thus one feature might only be useful in the absence of another. It's possible that features that intuitively ought to be positive features may actually take negative weight in a high dimensional model. These problems can be even harder to sort out when models are built on thousands or millions of features, a situation common in many problem domains.

Deep learning methods don’t admit simple explanations by introspecting individual model parameters. Thus, they may be more difficult to understand when feature sets are small. But for large data sets with highly correlated features, it's not clear that they are less interpretable than linear models.

Another sense in which a model might be considered interpretable is that someone could reason about how the model was chosen from among all possible models in the family. In other words, you might not know how the exact model works inside and out, but you understand the learning algorithm that created it.

This might mean that someone could say conclusively whether the model was the best among all models in its class. It might also mean that you have some a priori understanding of the generalization properties of the model. In other words, how sure are you that the model will generalize to examples that you haven’t seen. Surely the optimization properties of linear models are in fact better understood, even for high-dimensional data.

Generally, in my experience, demands of interpretability emerge from outside the core machine learning community. They come from stakeholders when machine learning is used in practice. And typically, this algorithmic analysis is not what people outside the core machine learning community mean by interpretability. More commonly, end-users want to understand what's happening at the model level, not the algorithm level.

Verbal Interpretations of Predictions are Potentially Misleading

It’s worth considering that simple, verbal explanations of complex machine learning models can potentially mislead users. Take linear models again as an example. Looking at which features get high weight, absent broader context, might yield dubious interpretations. What if one weight is high, but it's for a feature which takes the same value 95% of the time? In this case, this weight may not be informative for explaining how some decisions are made.

For truly complex models, it may be unreasonable to account for its full dynamics with terse verbiage. And perhaps the difficulty of producing simple explanations shouldn’t be surprising. Sometimes, our very purpose in building complex models is so that they can express complex hypotheses that we might not have been able to specify concisely. We should question the proposition that it’s reasonable to expect a short summary to meaningfully explain a complex model. Even with linear model, we sometimes build models precisely so that they can learn from a greater number of features than any human could consciously account for. Sometimes it might not be possible to account for predictions in a way that is both simply expressed and sensible.

Let's return to the relative interpretability of linear and deep models. Adding further complexity, while deep learning approaches often act on raw or lightly processed features, linear models are often subject to extensive preprocessing. TF-IDF transformations, dimensionality reduction, and ad-hoc feature engineering change the meaning of each feature. But the purported superior interpretability of linear models owed to the one-to-one correspondence between model parameters and features. What happens to the interpretability of a linear model when the assigned weights depend on subtle differences in preprocessing strategy?

In these cases, the interpretability of linear models might be washed away by our inability to assign intuitive meaning to heavily-engineered features. When comparing linear models and neural networks, we might ask whether there's a trade-off between using simple models but uninterpretable features or using simple features but uninterpretable models.

We should also note that decision trees, often championed for their interpretability, can be similarly opaque. To get accuracy rivaling other approaches, typically hundreds or thousands of decision trees are combined together in an ensemble. If we want just a single decision tree, this may come at the expense of the model's accuracy. And even with one tree, if it grows too large, it might cease to be interpretable, much like high-dimensional linear models.

We might also consider some ways in which both linear models and neural networks are interpretable. They are both differentiable from top to bottom. Thus it is possible to trace the contribution of each input to the output (for a given example) and to show which inputs will move the output most if slightly increased or decreased. Note, however, that for a neural network, these interpretations are local. So we shouldn’t assume that they will necessarily be meaningful.

Problems owing to desired interpretability often present stumbling-blocks in the medical domain, where practitioners are understandably reluctant to defer to a machine's judgment. At times, this reasoning is used to exclude powerful models owing to a perceived lack of interpretability. I’d caution that in some cases we may already rely on dubious interpretations, even from traditional approaches like logistic regression, decision trees, or support vector machines.

As a parting thought, I'd suggest argue that sometimes we turn to machine learning and not handcrafted decision rules because for many problems, simple, easily understood decision processes are insufficient. In these cases, the hypotheses that we would like to discover, which might actually perform well for some tasks, may be more complicated than those which can be intuitively arrived at or explained. The desire for interpretability is often sincere and often justified. But we should always think critically. In each case we could ask: Why do we want interpretability? What notion of interpretability applies? What techniques might satisfy these goals? What are we willing to sacrifice? Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs. He will be working for Amazon this summer as a machine learning scientist.