Bayesian Basics, Explained
This interview between Professor Andrew Gelman of Columbia University and marketing scientist Kevin Gray covers the basics of Bayesian statistics and how it differs from the ordinary statistics most of us learned in college.
Editor's note: The following is an interview with Columbia University Professor Andrew Gelman conducted by Marketing scientist Kevin Gray, in which Gelman spells out the ABCs of Bayesian statistics.
Kevin Gray: Most marketing researchers have heard of Bayesian statistics but know little about it. Can you briefly explain in layperson's terms what it is and how it differs from the 'ordinary' statistics most of us learned in college?
Andrew Gelman: Bayesian statistics uses the mathematical rules of probability to combines data with “prior information” to give inferences which (if the model being used is correct) are more precise than would be obtained by either source of information alone.
Classical statistical methods avoid prior distributions. In classical statistics, you might include in your model a predictor (for example), or you might exclude it, or you might pool it as part of some larger set of predictors in order to get a more stable estimate. These are pretty much your only choices. In Bayesian inference you can—OK, you must—assign a prior distribution representing the set of values the coefficient can be. You can reproduce the classical methods using Bayesian inference: In a regression prediction context, setting the prior of a coefficient to uniform or “noninformative” is mathematically equivalent to including the corresponding predictor in a least squares or maximum likelihood estimate; setting the prior to a spike at zero is the same as excluding the predictor, and you can reproduce a pooling of predictors thorough a joint deterministic prior on their coefficients. But in Bayesian inference you can do much more: by setting what is called an “informative prior,” you can partially constrain a coefficient, setting a compromise between noisy least-squares estimation or completely setting it to zero. It turns out this is a powerful tool in many problems—especially because in problems with structure, we can fit so-called hierarchical models which allow us to estimate aspects of the prior distribution from data.
KG: Could you give us a quick overview of its history and how it has developed over the years?
AG: The theory of Bayesian inference originates with its namesake, Thomas Bayes, an 18th-century English cleric, but it really took off in the late 18th century with the work of the French mathematician and physicist Pierre-Simon Laplace. Bayesian methods were used for a long time after that to solve specific problems in science, but it was in the mid-20th century that they became proposed as a general statistical tool. Some key figures include John Maynard Keynes and Frank Ramsey who in the 1920s developed an axiomatic theory of probability; Harold Jeffreys and Edwin Jaynes, who from the 1930s through the 1970s developed Bayesian methods for a variety of problems in the physical sciences; Jimmie Savage and Dennis Lindley, mathematicians who in research from the 1950s through the 1970s connected and contrasted Bayesian methods with classical statistics; and, not least, Alan Turing, who used Bayesian probability methods to crack the Enigma code in the second world war, and his colleague I. J. Good, who explored and wrote prolifically about these ideas over the succeeding decades.
Within statistics, Bayesian and related methods have become gradually more popular over the past several decades, often developed in different applied fields, such as animal breeding in the 1950s, educational measurement in the 1960s and 1970s, spatial statistics in the 1980s, and marketing and political science in the 1990s. Eventually a sort of critical mass developed in which Bayesian models and methods that had been developed in different applied fields became recognized as more broadly useful.
Another factor that has fostered the spread of Bayesian methods is progress in computing speed and improved computing algorithms. Except in simple problems, Bayesian inference requires difficult mathematical calculations—high-dimensional integrals—which are often most practically computed using stochastic simulation, that is, computation using random numbers. This is the so-called Monte Carlo method, which was developed systematically by the mathematician Stanislaw Ulam and others when trying out designs for the hydrogen bomb in the 1940s and then rapidly picked up in the worlds of physics and chemistry. The potential for these methods to solve otherwise intractable statistics problems became apparent in the 1980s, and since then each decade has seen big jumps in the sophistication of algorithms, the capacity of computers to run these algorithms in real time, and the complexity of the statistical models that practitioners are now fitting to data.
Now, don’t get me wrong—computational and algorithmic advances have become hugely important in non-Bayesian statistical and machine learning methods as well. Bayesian inference has moved, along with statistics more generally, away from simple formulas toward simulation-based algorithms.
KG: What are its key strengths in comparison with Frequentist methods? Are there things that only Bayesian statistics can provide? What are its main drawbacks?
AG: I wouldn’t say there’s anything that only Bayesian statistics can provide. When Bayesian methods work best, it’s by providing a clear set of paths connecting data, mathematical/statistical models, and the substantive theory of the variation and comparison of interest. From this perspective, the greatest benefits of the Bayesian approach come not from default implementations, valuable as they can be in practice, but in the active process of model building, checking, and improvement. In classical statistics, improvements in methods often seem distressingly indirect: you try a new test that’s supposed to capture some subtle aspect of your data, or you restrict your parameters or smooth your weights, in some attempt to balance bias and variance. Under a Bayesian approach, all the tuning parameters are supposed to be interpretable in real-world terms, which implies—or should imply—that improvements in a Bayesian model come from, or supply, improvements in understanding of the underlying problem under studied. The drawback of this Bayesian approach is that it can require a bit of a commitment to construction of a model that might be complicated, and you can end up putting effort into modeling aspects of data that maybe aren’t so relevant for your particular inquiry.
KG: Are there misunderstandings about Bayesian methods that you often encounter?
AG: Yes, but that’s a whole subject in itself—I’ve written papers on the topic! The only thing I’ll say here is that Bayesian methods are often characterized as “subjective” because the user must choose a “prior distribution,” that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.
KG: Do you think Bayesian methods will one day mostly replace Frequentist statistics?
AG: There’s room for lots of methods. What’s important in any case is what problems they can solve. We use the methods we already know and then learn something new when we need to go further. Bayesian methods offer a clarity that comes from the explicit specification of a so-called “generative model”: a probability model of the data-collection process and a probability model of the underlying parameters. But construction of these models can take work, and it makes sense to me that for problems where you have a simpler model that does the job, you just go with that.
Looking at the comparison from the other direction, when it comes to big problems with streaming data, Bayesian methods are useful but the Bayesian computation can in practice only be approximate. And once you enter the zone of approximation, you can’t cleanly specify where the modeling approximation ends and the computing approximation begins. At that point, you need to evaluate any method, Bayesian or otherwise, by looking at what it does to the data, and the best available method for any particular problem might well be set up in a non-Bayesian way.