Accuracy Fallacy: The Media’s Coverage of AI Is Bogus

Such as the gross exaggerations Stanford researchers broadcasted about their infamous "AI gaydar" project, there exists a prevalent "accuracy fallacy" in relation to AI from the media. Find out more about how the press constantly misleads the public into believing that machine learning can reliably predict psychosis, heart attacks, sexuality, and much more.

By Eric Siegel, Machine Learning Week on December 6, 2019 in Accuracy, AI, Hype, Media

comments

Headlines about machine learning promise godlike predictive power. Here are four examples:

Newsweek: “AI Can Tell If You're Gay: Artificial Intelligence Predicts Sexuality from One Photo with Startling Accuracy” (article)
The Spectator: “Linguistic Analysis Can Accurately Predict Psychosis” (article)
The Daily Mail: “AI-Powered Scans Can Identify People at Risk of a Fatal Heart Attack Almost a Decade in Advance... with 90% Accuracy” (article)
The Next Web: “This Scary AI Has Learned How to Pick Out Criminals by Their Faces” (article)

With articles like these, the press will have you believe that machine learning can reliably predict whether you're gay, whether you'll develop psychosis, whether you’ll have a heart attack, and whether you're a criminal – as well as other ambitious predictions such as when you'll die and whether your unpublished book will be a bestseller.

It's all a lie. Machine learning can’t confidently tell such things about each individual. In most cases, these things are simply too difficult to predict with certainty.

Here's how the lie works. Researchers report high "accuracy," but then later reveal – buried within the details of a technical paper – that they were actually misusing the word "accuracy" to mean another measure of performance related to accuracy but in actuality not nearly as impressive.

But the press runs with it. Time and again, this scheme succeeds in hoodwinking the media, a beast that all too often thrives on hyperbole. This time-honored tactic repeatedly generates flagrant publicity stunts that mislead.

Now, don't get me wrong – machine learning does deserve high praise. The ability to predict better than random guessing, even if not with high confidence for most cases, serves to improve all kinds of business and healthcare processes. That's pay dirt. And, in certain limited areas, machine learning can be highly accurate, such as for recognizing objects like traffic lights within photographs or recognizing the presence of certain diseases from medical images.

But many human behaviors defy reliable prediction. Predicting them is like trying to predict the weather many weeks in advance. There's no achieving high certainty. There's no magic crystal ball.

Stanford's "Gaydar" Doesn't Perform at Face Value

Take the hype surrounding Stanford University's infamous "gaydar" study. In its opening summary (the "abstract"), the university’s 2018 report claims their predictive model achieves 91% accuracy distinguishing gay and straight males from facial images. They reported lower performance for women, but I'm focusing here on their most highlighted and often reported result. I'll also skip past the researcher's problematic assertion that their results support a hormonal theory of sexuality rather than only reflecting cultural trends in self-presentation.

Stimulus-response: This report of high accuracy inspired journalists to broadcast grossly exaggerated claims of predictive performance. One Newsweek article kicked off with, "AI can now tell whether you are gay or straight simply by analyzing a picture of your face." This resulting deceptive media coverage is to be expected. The researchers’ opening claim of 91% accuracy tacitly and inevitably conveys – to lay readers, non-technical journalists, and even casual technical readers – that the system can tell who's gay and who isn't and usually be correct about it.

But that assertion is more than overblown – it's patently false. The model can't confidently "tell" for any given photograph. Rather, what Stanford's model can actually do 91% of the time is much less remarkable: It can identify which of a pair of two males are gay when it's already been established that one is and one is not.

This "pairing test" tells a good story, but it's a deceptive one. At first, it may sound like a reasonable indication of a predictive model's performance in general, since the test creates a level playing field where each case has 50/50 odds. And, indeed, the reported result does confirm that the model performs better than random guessing.

However, this result translates to low performance outside the research lab, where there's no contrived scenario presenting such pairings. In the real world, employing the model would require a tough trade-off. For example, you could tune the model to correctly identify 2/3rds of all gay individuals, but that comes at a price: When it predicted someone to be gay, it would be wrong more than half of the time – a high false positive rate. And if you configure its settings so that it correctly identifies even more than 2/3rds, the model will exhibit an even higher false positive rate.

The reason for this is that one of the two categories is infrequent – in this case, gay individuals, which amount to about 7% of the general population of males (going by the stats the Stanford study cites). When one category is in the minority, that intrinsically makes it more challenging to predict.

Besides, accuracy isn't a useful benchmark here in the first place. It's meaningless to achieve a bedazzling accuracy of 93%: Simply classify everyone as straight. By doing so, you're correct 93% of the time, even though you fail to correctly distinguish anyone in the minority, the 7% who are gay. To improve upon this and correctly identify at least some of the minority cases requires some trade-offs, namely, the addition of false positives and a lower overall accuracy.

Now, the researchers did indeed report on a viable measure of predictive performance, called AUC – albeit mislabeled in their report as "accuracy." AUC (a.k.a., AUROC, Area Under the Receiver Operating Characteristic curve) is a single value that indicates to the researcher the extent of performance trade-offs their predictive model is capable of. The higher the AUC, the better the trade-off options made available by the predictive model.

The researchers faced two publicity challenges: How can you make something as technical as AUC sexy and at the same time sell your predictive model’s level of performance to journalists? No problem. As it turns out, the AUC is mathematically equal to the result you get running the pairing test. And so a 91% AUC can be explained with a story about distinguishing between pairs that sounds to many journalists like "high accuracy" – especially when the researchers commit the cardinal sin of just baldly – and falsely – calling it "accuracy." Voila! Both the journalists and their readers believe the model can "tell" whether you're gay or straight.

Breaking News: Psychotic Breaks Are Still Mostly Unpredictable

"Machine learning algorithms can help psychologists predict, with 90% accuracy, the onset of psychosis by analyzing a patient's conversations." Thus opens an article in The Register (U.K.) eagerly covering an overzealous report out of Emory and Harvard Universities. Enshrined with the credibility of a publication in Nature, the researchers have the press believing their predictive model can confidently foretell who will develop psychosis and who won’t.

In this case, the researchers perpetrate a variation on the "accuracy fallacy" scheme: They report the classification accuracy you would get if half the cases were positive – that is, in a world in which 50% of the patients will eventually be diagnosed with psychosis. There's a word for measuring accuracy in this way: cheating. Mathematically, this usually inflates the reported "accuracy" a bit less than the pairing test, but it's a similar maneuver and far overstates performance in much the same way.

The Emory/Harvard report oversells magnificently. It features a tantalizing "90% accuracy" in its opening summary while omitting two other key qualifications needed in order to set a meaningful context. First, what is the expected proportion that will develop psychosis? That is, how often is the predictive model expected to field positive cases in its intended deployment outside the lab? That fundamental is undisclosed. However, in trying to ascertain this, a persistent reader can chase citations and ultimately infer that the model is designed not for the general population, but rather only for patients who are help-seeking and therefore presumably at a somewhat higher risk. While the general population only exhibits a 3% rate of psychotic disorders, one of the samples included in this study (the training data) exhibited a 23% rate. If that's the standard, it is still a good distance from the 50% over which they report performance. Second, their main result of 90% accuracy was established over a remarkably small number of cases: A sample of only 10 patients.

Unfortunately, mental illness is still tough to predict and, no, machine learning is not on its way to solving psychiatry, contrary to the belief held by some psychiatrists that AI will replace their job. This predictive model faces much the same limitations and trade-offs as the one that predicts sexual orientation. It will not be able to predict psychotic onsets without incurring many false positives. And, as before, "accuracy" isn't even a pertinent benchmark for judging predictive performance.

Accuracy: A Word So Often Used Inaccurately

The list goes on and on, with many more examples of overblown claims about machine learning that perpetrate the "accuracy fallacy."

Criminality. The Global Times (China) ran the headline, "Professor Claims AI Can Spot Criminals by Looking at Photos 90% of the Time." Also reporting on this work, in which a model predicts criminality based on facial features, MIT Technology Review and The Telegraph (U.K.) each repeated the 90% accuracy claim. But the masses have been misled; throughout their original publication, the researchers use "accuracy" to actually mean AUC.

Death. One headline claimed, "Google AI Predicts Hospital Inpatient Death Risks with 95% Accuracy." Google researchers published this result in Nature, leading the press astray by leaving it implied within the summary that AUC is a way to measure accuracy.

Suicide. The press reported on a model "that predicted suicide risk, using electronic health records, with 84 to 92% accuracy within one week of a suicide event." The Vanderbilt University researchers pulled the same maneuver as Google, leaving it implied within the summary of their research publication that AUC is equivalent to accuracy.

Bestselling books. Beyond predicting the health and behavior of humans, machine learning predicts the sales of books. What if publishers could decide whether to green light each unpublished manuscript by knowing beforehand whether it would very likely go on to become a bestseller? Spoiler: They can't. However, in the book, "The Bestseller Code: Anatomy of the Blockbuster Novel," the authors claim they've "written an algorithm that can tell whether a manuscript will hit the New York Times bestseller list with 80% accuracy," as The Guardian (U.K.) put it. The Wall Street Journal and The Independent (U.K.) also reported this level of accuracy. However, the authors conveniently established this accuracy level over a manufactured test set of books that were half bestsellers and half not bestsellers. Since in reality only one in 200 of the books included in this study were destined to become bestsellers, it turns out that a manuscript predicted by the model as a "future bestseller" actually has less than a 2% probability of becoming one.

And many more. The accuracy fallacy pervades, with researchers perpetrating it in the reports of spotting legal issues in non-disclosure agreements, IBM's claim that they can predict which employees will quit with 95% accuracy, classifying which news headlines are “clickbait”, detecting fraudulent dating profile scams, spotting cyberbullies, predicting the need for first responders after an earthquake, detecting diseases in banana crops, distinguishing high and low-quality embryos for in vitro fertilization, predicting heart attacks, predicting heart issues by eye scan, detecting anxiety and depression in children, diagnosing brain tumors from medical images, detecting brain tumors with a new blood test, predicting the development of Alzheimer's, and more.

The Accuracy Fallacy

For a machine learning researcher seeking publicity, the accuracy fallacy scheme features some real advantages: excitement from the crowds and yet, perhaps, some plausible deniability of the intent to mislead. After all, if the research process is ultimately clear to an expert who reads the technical report in full, that expert is unlikely to complain that the word "accuracy" is used loosely on the first page but then technically clarified on later pages – especially since "accuracy" in non-technical contexts can more vaguely denote "degree of correctness."

But this crafty misuse of the word "accuracy" cannot stand. The deniability isn't really plausible. In the field of machine learning, accuracy unambiguously means, "how often the predictive model is correct – the percent of cases it gets right in its intended real world usage." When a researcher uses the word to mean anything else, they're at best adopting willful ignorance and at worst consciously laying a trap to ensnare the media. Frankly, the evidence points toward the latter verdict. Researchers dramatically misinform the public by using "accuracy" to mean AUC – or, similarly, by reporting accuracy over an artificially balanced test bed that's half positive examples and half negative without spelling out the severe limits of that performance measure right up front.

The accuracy fallacy plays an integral part of the harmful hyping of “AI” in general. By conveying unrealistic levels of performance, researchers exploit – and simultaneously feed into – the population's fear of awesome, yet fictional, powers held by machine learning (commonly calling it artificial intelligence instead). Making matters worse, machine learning is further oversold because artificial intelligence is "over-souled" by proselytizers – they credit it with its own volition and humanlike intelligence (thanks to Eric King of “The Modeling Agency” for that pun).

Some things are too hard to reliably predict. "Gaydar" as a popular notion refers to an unattainable form of human clairvoyance (especially when applied to still images). We shouldn’t expect machine learning to attain supernatural abilities either. For important, noteworthy classification problems, predictive models just can't “tell” with reliability. This challenge goes with the territory, since important things happen more rarely and are more difficult to predict, including bestselling books, criminality, psychosis, and death.

The responsibility falls first on the researcher to communicate unambiguously and unmisleadingly to journalists and second on the journalists to make sure they actually understand the predictive proficiency about which they're reporting. But in lieu of that, unfortunately, readers at large must hone a certain vigilance: Be wary about claims of "high accuracy" in machine learning. If it sounds too good to be true, it probably is.

A shorter version of this article was originally published by Scientific American.

Related: