# How Bayesâ€™ Theorem is Applied in Machine Learning

Learn how Bayes Theorem is in Machine Learning for classification and regression!

**By Jaime Zornoza, Universidad Politecnica de Madrid**

In the previous post we sawÂ **what Bayesâ€™ Theorem is**, and went through an easy, intuitive example of how it works. You can find this postÂ **here****.Â **If you donâ€™t know what Bayesâ€™ Theorem is, and you have not had the pleasure to read it yet, I recommend you do, as it will make understanding this present article a lot easier.

In this post, we will see theÂ **uses of this theorem inÂ Machine Learning.**

*Ready? Lets go then!*

### Bayesâ€™ Theorem in Machine Learning

As mentioned in the previous post, Bayesâ€™ theorem tells use how toÂ **gradually update our knowledge**Â onÂ *something*Â as we get more evidence or that about thatÂ *something*.

Generally, inÂ **Supervised Machine Learning**, when we want to train a model theÂ **main building blocks**Â are a set of data points that containÂ **features**Â (the attributes that define such data points),**the labels**Â of such data point (the numeric or categorical tag which we later want to predict on new data points), and aÂ **hypothesis function**Â or model that links such features with their corresponding labels. We also have aÂ **loss function**, which is the difference between the predictions of the model and the real labels which we want to reduce to achieve the best possible results.

These supervised Machine Learning problems can be divided intoÂ **two main categories: regression**, where we want to calculate a number or **numeric** **value** associated with some data (like for example the price of a house), andÂ **classification**, where we want to assign the data point to aÂ **certain category**Â (for example saying if an image shows a dog or a cat).

**Bayesâ€™ theorem can be used in both regression, and classification.**

Lets see how!

**Bayesâ€™ Theorem in Regression**

Imagine we have a veryÂ **simple set of data**, which represents theÂ **temperature of each day**Â of the year in a certain area of a town (the**Â feature**Â of the data points), and theÂ **number of water bottlesÂ **sold by a local shop in that area every single day (theÂ **label**Â of the data points).

By making aÂ **very simple model**, we could s**ee if these two are related**, and if they are, then use this model toÂ **make predictions**Â in order to stock up on water bottles depending on the temperature and never run out of stock, or avoid having too much inventory.

We could try a very simple**Â linear regression model**Â to see how these variables are related. In the following formula, that describes this linear model, y is the target label (the number of water bottles in our example),Â **each of the Î¸s is a parameter of the modelÂ **(the slope and the cut with the y-axis) and x would be our feature (the temperature in our example).

The goal of this training would be toÂ **reduce the mentioned loss function**, so that the predictions that the model makes for the known data points, are close to the actual values of the labels of such data points.

After having trained the model with the available data we would get a value for both of theÂ **Î¸s**. This training can be performed by using anÂ **iterative processÂ **like gradient descent or anotherÂ **probabilistic method**Â like Maximum Likelihood. In any way, we would justÂ **have ONE single value**Â for each one of the parameters.

In this manner, when we getÂ **new data without a label**Â (new temperature forecasts) as we know the value of theÂ **Î¸s**, we could just use this simple equation to obtain the wantedÂ ** Ys**Â (number of water bottles needed for each day).

When we use Bayesâ€™ theorem for regression, instead ofÂ **thinking of the parametersÂ **(the Î¸s) of the model as having a single, unique value, we represent themÂ **as**Â parametersÂ **having a certain distribution**: the prior distribution of the parameters. The following figures show the generic Bayes formula, and under it how it can be applied to a machine learning model.

The idea behind this is thatÂ **we have some previous knowledge of the parameters of the modelÂ **before we have any actual data:Â ** P(model)Â **is this prior probability. Then,Â

**when we get some new data, we update the distribution of the parameters**Â of the model, making it the posterior probabilityÂ

**.**

*P(model|data)*What this means is that**Â our parameter setÂ **(the Î¸s of our model) is not constant, but insteadÂ **has its own distribution**. Based on previous knowledge (from experts for example, or from other works)Â **we make a first hypothesis**Â about the distribution of the parameters of our model. Then as we train our models withÂ **more data**,Â **this distribution gets updated**Â and grows more exact (in practice the variance gets smaller).

This figure shows theÂ **initial distribution of the parameters of the modelÂ p(Î¸)**, and how as we add more data this distributionÂ

**gets updated,Â**making it grow more exact to

**, where x denotes this new data. The Î¸ here is equivalent to theÂ**

*Â p(Î¸|x)**model*Â in the formula shown above, and theÂ

**Â here is equivalent to theÂ**

*x**data*Â in such formula.

Bayesâ€™ formula, as always, tells usÂ **how to go from the prior to the posterior probabilities**. We do this in an iterative process as we get more and more data, having theÂ **posterior probabilities become the prior probabilities for the next iteration**. Once we have trained the model with enough data, to choose the set of final parameters we would search for theÂ **Maximum posterior (MAP) estimation to use a concrete set of values for the parameters of the model.**

This kind of analysisÂ **gets its strength from the initial prior distribution**: if we do not have any previous information, and canâ€™t make any assumption about it, other probabilistic approaches like Maximum Likelihood are better suited.

However,Â **if we have some prior information about the distribution of the parameters the Bayesâ€™ approach proves to be very powerful**, specially in the case of havingÂ **unreliable training data**. In this case, as we are not building the model and calculating its parameters from scratch using this data, but rather using some kind of previous knowledge to infer an initial distribution for these parameters,Â **this previous distribution makes the parameters more robust and less affected by inaccurate data.**

I donâ€™t want to get very technical in this part, but the maths behind all this reasoning is beautiful; if you want to know about it donâ€™t hesitate and email me to jaimezorno@gmail.com orÂ **contact me**Â onÂ LinkdIn.

**Bayesâ€™ Theorem in Classification**

We have seen how Bayesâ€™ theorem can be used for regression, by estimating the parameters of a linear model. The same reasoning could be applied to other kind of regression algorithms.

Now we will see how to use Bayesâ€™ theorem for classification. This is known asÂ **Bayesâ€™ optimal classifier**. The reasoning now is very similar to the previous one.

Imagine we have a classification problem with**Â iÂ different classes**. The thing we are after here isÂ

**the class probability**for each class

**Â w**. Like in the previous regression case, we also differentiate between prior and posterior probabilities, but now we haveÂ

*i***prior class probabilities**Â

**Â andÂ**

*p(wi)***posterior class probabilities**, after using data or observationsÂ

**.**

*p(wi|x)*

HereÂ ** P(x)**Â is theÂ

**density function**Â common to

**Â all the data points**,

**Â is theÂ**

*Â P(x|wi)***density function of the data points belonging to classÂ**, and

*wi***Â is the prior distribution of classÂ**

*Â P(wi)***.Â**

*wi***Â is calculated from the training data, assuming a certain distribution and calculating aÂ**

*P(x|wi)***mean vector for each class**Â and the

**Â covariance of the features**Â of the data points belonging to such class. The prior class distributionsÂ

**are estimated based onÂ**

*P(wi)*Â**domain knowledge**, expert advice or previous works, like in the regression example.

Lets see an example of how this works: Image we have measured the height of 34 individuals:Â **25 males (blue)**Â andÂ **9 females (red)**, and we get aÂ **new**Â heightÂ **observationÂ **of 172 cm which we want to classify as male or female. The following figure represents the predictions obtained using aÂ **Maximum likelihood classifier and a Bayes optimal classifier.**

In this case we have used theÂ **number of samples**Â in the training dataÂ **as**Â theÂ **prior knowledge**Â for our class distributions, but if for example we were doing this same differentiation between height and gender for a specific country, and we knew the woman there are specially tall, and also knew the mean height of the men, we could have used this**Â information to build our prior class distributions.**

As we can see from the example, using these**Â prior knowledge leads to different resultsÂ **than not using them. Assuming this previous knowledge is of high quality (or otherwise we wouldnâ€™t use it), these predictions should beÂ **more accurate**Â than similar trials that donâ€™t incorporate this information.

After this, as always, as we getÂ **more data**Â these**Â distributions would get updated**Â to reflect the knowledge obtained from this data.

As in the previous case, I donâ€™t want to get too technical, or extend the article too much, so I wonâ€™t go into the mathematical details, butÂ **feel free to contact me if you are curious about them**.

### Conclusion

We have seenÂ **how Bayesâ€™ theorem is used in Machine learning**; both inÂ **regressionÂ **andÂ **classification**, to incorporate previous knowledge into our models and improve them.

In the following post we will see howÂ **simplifications of Bayesâ€™ theoremÂ **are one of the most used techniques forÂ **Natural Language Processing**Â and how they are applied to many real world use cases like spam filters or sentiment analysis tools. To check it outÂ **follow me on Medium**, and stay tuned!

That is all, I hope you liked the post. Feel Free to connect with me onÂ LinkedInÂ or follow me on Twitter atÂ **@jaimezorno**. Also, you can take a look at my other posts on Data Science and Machine Learning**Â ****here**. Have a good read!

### Additional Resources

In case you want to go more in depth into Bayes and Machine Learning, check out these other resources:

- How Bayesian Inference works
- Bayesian statistics Youtube Series
- Machine Learning Bayesian Learning slides
- Bayesian Inference

and as always, contact me with any questions. Have a fantastic day and keep learning.

**Bio: Jaime Zornoza** is an Industrial Engineer with a bachelor specialized in Electronics and a Masters degree specialized in Computer Science.

Original. Reposted with permission.

**Related:**

- Probability Learning I: Bayesâ€™ Theorem
- How Bayesian Inference Works
- Deep Learning for NLP: ANNs, RNNs and LSTMs explained!