How Not To Lie With Statistics
Darrell Huff's classic How to Lie with Statistics is perhaps more relevant than ever. In this short article, I revisit this theme from some different angles.
"What is truth?" and "What is a lie?" are questions that have drawn the attention of philosophers, theologians, legal scholars and intellectuals of many kinds for centuries. I am not a scholar or intellectual, merely a hardhat statistician working in marketing research and what is vaguely called data science. Regardless of what we do for a living, however, all of us are consumers of statistics at work and in our daily lives. “Statistics” can refer to figures or mathematical models, and either can be used to deceive us, are often misinterpreted or can be flat out wrong.
Deception in various forms can be found in nature, and pet owners may have noticed that it is not exclusively a human trait. Besides outright lies, distortions and deceptions, there are also what have recently come to be called cognitive biases that have long been of concern to statisticians and the scientific community. Data and elaborate statistical models will not always win debates and it is not unusual for them simply to be dismissed. We can unintentionally deceive others or sucker ourselves, and many of the most important untruths are not deliberate deceptions.
Further to that point, I find many people, including statisticians, are being deceptive without realizing it. They may have done a little reading on a subject they are blogging about or presenting on, for example, and are sincerely conveying what they know, or think they know, about this subject. What they say may be very wrong, however. Sadly, there are others who, for whatever reasons, are unconcerned with the truth. They're slinging' it and they know it.
Like attorneys, statisticians are sometimes asked to lie. Seriously. Usually, the person making the request is coy and their motivation cloaked with comments about client needs or something else which seems plausible but, in effect, they are asking us to lie. Often, they have screwed up and are looking for ways to talk their way out of the jam they've talked themselves into. Whatever the motive, when the buck is passed to you, my advice to statisticians is pass it right back.
There also are simple misunderstandings and miscommunications with important ramifications for decision makers. Statistics is difficult to explain in everyday language and this is often the cause, but miscommunications and misunderstanding also happen when non-statisticians use jargon to impress without understanding what it means. Statisticians can be fooled, too.
"It doesn't have to be perfect!" or "What you say may be true theoretically..." are indications that the person raising this objection does not grasp that what they regard as dweeby minutia actually has serious implications for decision makers. Statisticians are still fighting the stereotype that we are geeks with no understanding of business. (Unfortunately, there is some truth to this stereotype.) Make sure to put your objections in writing if this is a potentially serious matter.
I could give pages of examples, as could anyone working in a specialized field or consulting capacity. Instead, let me propose a few simple guidelines on how not to lie with statistics by conveying inaccurate information inadvertently. Again, however innocent, even small misunderstandings and miscommunications can have profound consequences.
One cannot overemphasize how important it is for a statistician to have a clear understanding of the essential details of a project. Not just data matters, but who will be using the results of the research or analytics, for what purposes they will be used and the expectations of the person or persons who will be footing the bill. In repetitive projects such as tracking or analytics that have already been operationalized and just need a periodic "health check", this is less important, though critical when the project is being designed.
Present or report only the key findings and implications, and do this as simply as possible. If complex visualizations or videos will be shown to the end users, leave that up to the pros. This is not normally what statisticians are hired for. Again, this gets back to expectations and being specific about our deliverables. That said, serious misinterpretations may occur because of misunderstandings on the part of the person preparing the report or presentation, or because the statistician wasn't communicating clearly. I usually request a peek at the report or presentation before it's finalized if I'm not preparing it myself. Working internationally, as I do, I find even a quick summary in English over Skype or by email helps when the deliverable is in a language I do not speak.
Never try to show off your technical prowess, and avoid jargon. Otherwise you’ll probably only confuse or offend your clients and business associates who are not statisticians.
Make sure you know what you're talking about! Trying to learn about a statistical method through online searches and blogs can be very risky, even for those trained in statistics. Some people calling themselves data scientists or statisticians seem primarily interested in R code, which they can copy/paste and modify slightly for the task at hand. There is a ton of this code freely downloadable on the internet. These folks may not actually know what they are doing, though, and this all-too-common practice reflects an amateur programmer’s mentality, not that of a statistician or true data scientist.
The "everyone does it, so it must be OK" mindset seems especially widespread these days. Mark Twain had some thoughts regarding this and I will only suggest it is a bad habit for a statistician to get into. If "everyone does it" reflects a rare consensus among authentic statistical experts, that is a different matter, but not what I mean here.
In the classroom, statisticians are typically advised to seek the simplest possible solution. Occam's Razor is an extremely useful guideline, so I am not making a criticism here. The best analysts seem to have a gift for seeing what, in retrospect, seems obvious. This also holds for business people. Amazon is now part of the daily lives of billions of people around the world, but twenty years ago seemed like a wacky idea to many of us and doomed to fail.
Statistical models can also be too simple and mislead us that way. An experienced and competent statistician knows how to rule out irrelevant models and pick the one that is both robust in a technical sense and most useful to the decision makers. This is not easy, however, and cannot be done “by the numbers.” Occasional claims to the contrary, AI cannot yet do this and will never be able to until Artificial General Intelligence is a reality.
Relying on automated or semi-automated procedures, however, is often the only feasible approach when the modeler is tasked with generating an enormous number of models that predict well enough for a narrow purpose - recommender systems large online retailers have deployed come to mind. This sort of mass-modeling characterizes quite a lot of data science. Mass-produced predictions are not all equally good, however, and not guaranteed to be profitable. By contrast, in marketing research, multivariate analysis is normally charged with uncovering “the why” and opaque predictive computer algorithms are less useful.
Returning to cognitive biases, as noted, statisticians and scientists generally have long known how easy it is for their worldview and egos to interfere with their intellects and learning. No one is invulnerable to these basic human frailties, including those who earn their living speaking and writing about them. Try to understand where you're coming from and be as objective as humanly possible.
To briefly sum up:
- Simple misunderstandings can be just as consequential as outright lies and are much more common. We also say things that are inaccurate without realizing it.
- If someone tries to twist your arm into saying something that is not true, or doing something that is clearly unethical, such as altering data, refuse as tactfully as you can. Put your objections in writing, if necessary.
- Be very careful about making assumptions. Do your homework. It's better to ask too many questions than too few. Sometimes I warn new clients and business partners upfront that I ask lots of questions, apologizing in advance so to speak.
- Communicate clearly and avoid statistical jargon. Never show off.
- Be sure you know what you're talking about! Similarly, don't assume that non-statisticians who use technical terms understand what they mean. Very often, they do not.
- Keep your analysis and deliverable as simple as possible...but not too simple. If you are not preparing the final deliverable - statisticians normally do not - make sure your own work is being correctly summarized and interpreted.
- Be wary of automated or semi-automated modeling. Sometimes they are the only option but only in certain situations. It’s also important remember that even automated models are not all the same. “…all models are wrong, but some are useful.”
- Just because "everyone does it" does not mean it's OK. Even professional statisticians can develop bad habits.
- Be on the lookout for cognitive biases, including your own! In the real world, logic and evidence lose more battles than they win, and sometimes we are our own worst enemies.
This has only been a snapshot of a large and sensitive topic, but I hope you’ve found it interesting and helpful!
The background photo is of Richard von Mises, one of the most influential statisticians of the 20th century and author of Probability, Statistics and Truth.
Bio: Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.
Original. Reposted with permission.
- Want to Become a Data Scientist? Read This Interview First
- Demystifying Data Science
- Making Sense of Machine Learning