Gold BlogHow Bayesian Inference Works

Bayesian inference isn’t magic or mystical; the concepts behind it are completely accessible. In brief, Bayesian inference lets you draw stronger conclusions from your data by folding in what you already know about the answer. Read an in-depth overview here.

Bayes’ theorem

So now we get to the part that we really care about. We want to answer the question “If we know someone has long hair, what is the probability that they are a woman (or a man)?” This is a conditional probability, P(man | long hair), but the reverse of the one we already found, P(long hair | man). Since conditional probabilities are not reversable, we can’t say anything about the new conditional probability yet.

Luckily Thomas Bayes noticed something cool that can help us.

Remembering how we calculated joint probabilities, we can write the equations for P(man with long hair) and P(long hair and man). Because joint probabilities are reversable, these two things are equal.

With a little bit of algebra, we can solve for the thing we care about, P(man | long hair).

Expressed in terms of A and B, instead of “man” and “long hair” we get Bayes’ Theorem.

Finally we are ready to go back and solve our movie ticket dilemma. We have Bayes’ Theorem applied to our problem.

First we need to expand our marginal probability, P(long hair).

Then we can plug in our numbers and calculate the probability that someone is a man, given that they have long hair. For moviegoers in the men’s restroom line, P(man | long hair) is .8. This confirms our intuition that the ticket dropper is probably a man. Bayes’ Theorem has captured our intuition about the situation. Most importantly, it has incorporated our pre-existing knowledge that there are far more men than women in the men’s restroom line. Using this prior knowledge, it updated our beliefs about the situation.

Probability distributions

Examples like the theater dilemma are good for explaining where Bayesian inference comes from and showing the mechanics in action. However, in data science applications it most often used to interpret data. By pulling in prior knowledge about what we are measuring, we can draw stronger conclusions with small data sets. I’ll show how this works in detail, but first please bear with me for one more side track. We need to get clear about what we mean by “probability distributions.”

You can think of probability as a pot of coffee that has exactly enough left to fill one cup. If there’s only one cup to fill there’s no problem, but if you have more than one you have to decide how to distribute the coffee between the cups. You can split it however you like, as long as you pour out all the coffee into one cup or the other. At the movie theater, one mug might represent women and the other, men.

Or we could use four mugs to represent the distribution of all combinations of gender and hair length. In both cases, the total amount of coffee adds up to one cup.

Usually, we set these mugs side by side and look at the amount of coffee in each as a histogram. It can be helpful to think of coffee as our belief, and its distribution shows how strongly we believe something to be the case.

If I flip a coin and hide the result from you, then your belief will be evenly split between heads and tails.

If I roll a die and hide the result from you, then your belief about the number on top will be evenly split between each of the six sides.

If I buy a powerball ticket, your belief that it is a winner will probably be very close to zero. The coin flip, the die roll, the powerball outcome - these are each an example of measuring and collecting data.

Not surprisingly, you can also hold beliefs about other collected data. Consider the height of adults in the US. If I tell you I have met and measured someone, then your beliefs about their height might look like the picture above. This shows a belief that this person is probably between 150 and 200 cm, and most likely between 180 and 190 cm.

Distributions can be broken up into finer and finer bins. You can think of it as spreading less coffee across more cups to get a finer-grained set of beliefs.

Eventually the number of imaginary cups you need gets so large that the analogy breaks down. At that point the distribution is continuous. The math to work with it changes a bit, but the underlying idea is still useful. It is shows how your belief is allocated.

Thanks for your patience. Now with probability distributions described, we can use Bayes’ Theorem to interpret data. To illustrate this, we’ll weigh my dog.