Demystifying Bad Science
Rigorous science is challenging and any study can be questioned. Deception is part of human nature and scientists are human, as are journalists and policymakers. We are too and must be careful not to trust a study just because we find it exciting, or because it comforts us or conforms to our beliefs.
Image by LoganArt on Pixabay
Many widely-held scientific theories were later proven wrong, as this brief article shows. How can this happen?
First, science is still evolving, and our understanding of many basic phenomena remains far from complete. Another reason is that science - at least on our planet - is conducted by humans and we humans have many foibles. Biases of various sorts, funding conflicts, egos, and sheer incompetence are some of the very human things that can undermine any research.
Scientists sometimes get it right but those reporting it get it wrong. Few journalists have worked as scientists and most have had no more training in science than the majority of their readers or viewers. To be fair, though, many scientists themselves have had limited coursework in research methods and statistics, as I point out in Statistical Mistakes Even Scientists Make.
Peer review can sometime resemble chum review and, moreover, some studies make the front page without having been peer reviewed at all. Few editors and reviewers of scientific publications are statisticians - there aren't that many statisticians and they have their day jobs. In "softer" fields standards are arguably even less rigorous.
Null Hypothesis Significance Testing (NHST) has been heavily criticized by statisticians over the years. Many of you will remember this from an introductory statistics course. While on the surface it may seem straightforward, NHST is widely misunderstood and misused.
The American Statistician has devoted a full open-access issue to this and associated topics. Put simply, one important concern is that findings with p values greater than .05 are less likely to be accepted for publication, or even submitted for publication, than those with statistically significant findings. This is known as publication bias or the file drawer problem.
However, "negative findings" are just as important as statistically significant results and many potentially important research results apparently never see the light of day. Since accumulation of knowledge is the essence of science, this is a serious problem which only recently has been getting the attention many statisticians have long felt it warranted. Statistical significance is not the same as decision significance, either.
Another reason is that small sample studies are common in many fields. While a gigantic sample does not automatically imply that findings can be trusted, estimates of effect sizes are much more variable from study to study when samples are small. Conversely, trivial effect sizes with little clinical or business significance may be statistically significant when sample sizes are large and, in some instances, receive extensive publicity.
Non-experimental (observational) research is prevalent in many disciplines and in the era of big data seems to be experiencing a boom. While randomized experiments are not always feasible or ethical, this does not mean that non-experimental research is sufficient.
I summarize some of these issues in Propensity Scores: What they are and what they do and Meta-analysis and Marketing Research. Put simply, effect size estimates are generally more variable - less reliable - in non-experimental research. Back to publication bias again...
Thousands of studies are conducted each year around the world, which means there would be a lot of bad science even if standards were uniformly high. Science is hard. Here are a few things to a watch out for.
Cross-sectional versus longitudinal data. Causes should precede their effects and, in observational studies, it is usually not possible to ascertain this ordering when data pertain to a single slice in time. However, when data are collected at more than one point in time, we can often confirm whether or not a hypothetical cause did in fact precede its hypothesized effect. Longitudinal data, in general, permit a broader range of analyses which can help us better understand how variables interrelate.
Ecological studies are often problematic since the unit analysis being studied is the group and, therefore inferences cannot be made about individual study participants. Researchers often have no data at the individual level regarding exposure and disease.
Non-probability sampling. Inferential statistics assumes probability sampling. When the data are from convenience samples or other non-probability samples, it is difficult to know which population we can generalize the results too.
WEIRD participants. In some fields, such as epidemiology and pharmacology, participants may not even be human. Generalizing from rats to humans requires many assumptions, for example.
Linear no-threshold model (LNT). This is a highly simplistic type of dose-response model that has been severely criticized. In business, linear (straight line) relationships are often assumed between "dose" (e.g. aspects of customer experience) and "response" (e.g., overall satisfaction with a firm). This is often reasonable, but not always, and may make little sense in fields such as toxicology, where it is known that very small doses typically have no effect and, beyond a certain level, toxicity no longer increases with dose.
An inappropriate statistical model. This can be hard to detect, but it's fair to say questionable use of statistics is not uncommon in any discipline. An appropriate statistical model can also be used in inappropriate ways.
Inadequate covariate control. This is especially prevalent in observational research, in which potentially important background variables are not always adjusted for. In some studies, continuous variables such as age are grouped into broad categories, resulting in a loss of information. Therefore, it may be questionable to claim that the variable has been "controlled for."
Omitted variables. Important variables may not have been available or, for whatever reason, might have been left out of the analysis. Many studies are criticized on these grounds.
Failure to consider other explanations. Multiple causes might lead to the same effect, and the failure to consider rival explanations undermines a study's credibility. As with omitted variables, this may be accidental or intentional.
No corrections for multiple comparisons. Statistical tests on the same data are not independent of each other. Thus, if pairwise comparisons are made among four types of patients at the standard .05 alpha level, six tests will be necessary and the overall confidence level for the set of comparisons is about 75%, not 95%.
Use of surrogates as the dependent variable. Often, it's not possible to measure the outcome directly and researchers must rely on surrogates. An example in medical research would be the use of test results to indicate the presence of a particular disease. While the use of surrogates is not necessarily a flaw, in some studies it may be problematic.
No adjustments for measurement error. It most research, variables are measured with error. In some cases, such as personality assessments or aptitude measurement, the error can be substantial. In general, measurement error attenuates correlations, thus the relationship between x and y may be stronger than it appears based on correlations or other measures of association. One kind of measurement error that plagues surveys is response style, for example, when a respondent tends to use the high end of the scale irrespective of what is being rated.
Mine until you find seems to be the motto of some researchers, and this is a manifestation of a particularly dangerous form of malpractice known as HARKing. Stuff Happens elaborates a bit more on this complex subject.
Regression to the mean is a statistical phenomenon that can make natural variation in repeated data look like real change. It happens when unusually large or small measurements tend to be followed by measurements that are closer to the mean. This phenomenon can make it appear that an educational program or therapy, for example, was effective when, in actuality, it was not.
"Millions affected." Headlines such as this may conceal tiny effect sizes that are statistically zero. We need to consider the base sizes on which these sorts of frightening figures have been calculated.
Use of new and untested methodology. New is not always better and tried-and-true methodologies are normally more trustworthy than novel ones that have not yet been scrutinized by independent researchers and statisticians.
Conflicts with other research. A controversial finding may be a paradigm buster but may also indicate poor methodology or improper use of statistics.
Funding conflicts may undermine the credibility of a study, but accusations of funding conflicts may themselves have been funded in questionable ways.
The origin of the popular quote "lies, damned lies, and statistics" is uncertain, though it has been attributed to Mark Twain, Disraeli, and several others. However, regardless of its origin, it did not refer to the modern field of statistics, which was just beginning to emerge at the time. Most likely, it pertained to official figures, which is what statistics originally meant. Here are some ways to lie with statistics.
Establishing “truth” through repetition is a very common tactic and one Joseph Goebbels, a noted authority on deception, explicitly recommended. Few people scrutinize claims, and fewer yet will remember past predictions by the same person or organization that turned out to be badly wrong.
Straw man arguments and "rebuttals" are commonplace, as are ad hominem attacks, and both are especially useful for those who themselves have something to hide.
Generalizing from the exception and making rare events seem typical are also popular tactics. Confusing the possible with the plausible and the plausible with fact is a variation on this theme.
There is also cherry picking of data, models and previous research. One form of cherry picking is selecting only the section of a time series that supports one’s case. "Adjusting" data falls short of outright fabrication but is a related technique. Clever, if questionable, interpretations of data or statistical models are two more weapons of the unethical.
Computer simulations are sometimes misleadingly called experiments and simulated data subtlety passed off as empirical data. Cooking up a computer model to "prove" one's theory is now easier than ever.
Societies are hierarchical by nature and humans are inclined to think dichotomously, thus authorities are often invoked and debates over scientific or policy issues frequently take on a good guys versus bad guys flavor. Misrepresenting what authorities really believe is also not unheard of.
We also struggle to migrate between frequencies, percentages, and proportions, and this is something we need to be very mindful of. For example, we may read that millions of people will be affected if policy makers do this or don’t do that. A close reading of the evidence cited, however, may reveal a very weak effect size whose confidence or credible interval overlaps with zero. Multiplying a tiny faction by hundreds of millions or billions of people will yield a terrifying figure. Also, bear in mind that a “50% increase” could mean from .001 to .0015.
I haven’t mentioned data visualizations, which can easily deceive us. Many people are led to believe that random means even when, in fact, evenly-distributed figures are very unlikely to be random. There is also what I call Whack-A-Mole, citing one dubious claim after another in rapid succession without responding to criticisms of any of them.
Statistical thinking, critical in science, does not come naturally to humans. No one is born a statistician, and educational curricula frequently shortchange statistics.
In summary, rigorous science is challenging and any study can be questioned. Deception is part of human nature and scientists are human, as are journalists and policymakers. We are too and must be careful not to trust a study just because we find it exciting, or because it comforts us or conforms to our beliefs.