Interpretability over Accuracy

If researchers can’t understand a provided answer, it is not viable. They can’t write about techniques they don’t understand beyond “Here are the numbers. Look how pretty my model is.” Good research, that ain’t.

By Adam Edwards, Diary of a Data Scientist (Salford Systems).

Data science and statistics

The majority of my real word statistical experience is collaborative work with researchers.  As a statistical collaborator, both at the Laboratory for Interdisciplinary Statistical Analysis (LISA) at Virginia Tech, and the Sokoine University of Agriculture Laboratory for Interdisciplinary Statistical Analysis (SUALISA) in Morogoro, researchers seek my help for analysis in their publications.  As a statistician, or a data scientist, the onus is on me to provide an answer that they can use.

“That they can use” are the important words in that imperative.  If researchers can’t understand a provided answer, it is not viable.  They can’t write about techniques they don’t understand beyond “Here are the numbers.  Look how pretty my model is.” Good research, that ain’t.

This concept favors relatively simple models that are easily explained and interpreted. My clients have even rejected Bayesian models (obviously widely accepted within statistics) because they didn’t know enough about the model to explain it in a publication.

Most traditional techniques provide models with very simple interpretations.  For a regular linear regression a coefficient of -1.3 with X means that if you increase X by 1, we’d expect Y to decrease by 1.3.  In real scenarios this translates to “If I increase a pollutant by one part per million, I’d expect average school size to decrease by 1.3 fish.” Obviously there are no three tenth fish, but the coefficients reflect averages.


One classical model with a somewhat convoluted interpretation really frustrates me as a collaborative statistician trying to provide intuitive solutions: logistic regression.

The problem for logistic regression (and probit as well), is that the coefficient is not directly related to the response.  “Logistic” refers to the logit, which is the log of the odds of the response (odds are equal to probability over one minus probability).  So the coefficient relates to a transformation of the probability of response, and because the function is not linear, interpretation depends on relative location on the number line.  An increase from -2.6 to -1.3 represents an increase in probability of 14.5%, where an increase from -1.3 to 0 represents a 28.5% increase.  The same logit coefficient almost doubles the effect close to zero.


Was that confusing?  Most of my clients think so, unless they already have a decent statistical background.  Generally, as a workaround, I reference the change with respect to the mean.  If 80% of fish survive on average, a 1 ppm increase in the pollutant with a coefficient of -1.3 equates to a drop of 28% in survival rate.  Additional variables muddy the picture, because effects don’t compound additively on the response scale like they do on the logit scale.  This interpretation is passable at best.

A model directly related to the scale of the response is a better solution.  I have suggested that clients use Classification and Regression Trees (CART) for the increased interpretability they provide.  Frequently, a simple decision tree (like CART) cannot model a process with enough smoothness to capture the trend, but if it is close, there are rules with easy interpretations, and predictions are the probabilities in the end nodes.

To take this out of the realm of crazy fish rambling, and into the world of concrete (albeit contrived) examples, I turn to the UCI weightlifting dataset.  Imagine I, as a consultant, am visited by a weightlifter, who, like all weightlifters, has gathered extensive and precise data on the exact form of a lift. The data contain information on thousands of lifts performed on video with censors on the weights recording the pitch, yaw, and roll (three axial twists) of the dumbbells, forearms, arms, and belt.  The response is a subjective classification of lift quality, on an A is best to E is worst scale, assessed on viewing the video.  Based only on information about twisting in three directions, my client wants to know how a lifter can improve form to obtain a better classification.


Figure 1 Part of the data

So what is the interpretation of a logistic model in this case?  Well for starters, without some second order terms (denoted by a 2 above), a logistic model can only tell you to increase or decrease twisting for improved performance.  A positive coefficient with forearm roll (regardless of magnitude) means you should twist inward as much as possible.  That negative coefficient with belt pitch means you should stick your butt out as far as you can.  The model also says you need to flare your elbows.  That’s great form.


Figure 2 Results of a quick and dirty second order logistic regression model

Even with some higher order terms (allowing for some ideal points to form in the surface), it is left up to the researcher to optimize the equation for best probability of an “A” class lift.  There might be multiple local maxima describing different, and very specific motions to achieve results. For the best probability of an A class lift (fabricated numbers incoming), I need arm yaw of 119.125 degrees, -16.213 forearm roll…


Figure 3 Part of the CART tree containing the best node

For the CART tree, the weightlifter can easily follow the nodes to the set of rules that yields the best outcome. Belt pitch greater than 16 degrees, forearm roll between -140 and 113 (holy cow I guessed those numbers before I built the model, and they are basically right).

My client is unlikely to do every lift with sensors activated, stopping after each to observe values and make necessary adjustments until such a time as all lifts are perfect.  That process would be as exhausting as reading about it.  A range of achievable values for good classification provides a more realistic solution, and more useful answer despite losing some accuracy.