Probability Mass and Density Functions

This content is part of a series about the chapter 3 on probability from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts.

By Hadrien Jean, Machine Learning Scientist at Ava Accessibility


This content is part of a series about the chapter 3 on probability from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts.

Github: the corresponding Python notebook can be found here.



I’m happy to present here the following of my series on the Deep Learning Book by Goodfellow et al. This is the first post/notebook made from chapter 3 on Probability. For those who already know my posts/notebooks about Chapter 2 on linear algebra, the aim and the structure are the same. The goal is to make the book more accessible for people without a deep mathematical background.

I think that it is possible to gain a better intuition on mathematical concepts by using code (here Python) and visualisations. The structure follows the book sub-chapters and it can be used as additional content, giving examples and details.

This first part is about chapters 3.1 to 3.3. Chapter 3.1 is an introduction to probability and contains no technical difficulty. You can thus directly read it here. Chapter 3.2 is really only a definition so the main part is 3.3 on probability mass function and probability density function. After reading it, random variables and their probability distributions (for discrete and continuous variables) will have no secret for you.

In order to understand it we’ll also study some very useful mathematical tools:

???? Discrete vs. continuous variable

???? Derivative

???? Integrals

???? Area under the curve

These notions are important to catch for general data science and machine learning.


3.2 Random Variables

The goal of probability is to deal with uncertainty. It gives ways to describe random events. A random variable is a variable that can take multiple values depending on the outcome of a random event. The possible outcomes are the possible values taken by the variable.

If the outcomes are finite (for example the 6 possibilities in a die throwing event) the random variable is said to be discrete.

If the possible outcomes are not finite (for example, drawing a number between 0 and 1 can give an infinite number of values), the random variable is said to be continuous.

As in the book, we will use the following notation: a lower case letter in plain typeface for a random variable: x


Example 1.

Let’s say that the variable x is a random variable expressing the result of a dice roll ????. The variable can take the value 1, 2, 3, 4, 5 or 6. It is a discrete random variable.


3.3 Probability Distributions

So a random variable can take multiple values. One very important thing is to know if some values will be more often encountered than others. The description of the probability of each possible value that a random variable can take is called its probability distribution.

This idea of discrete versus continuous variable is important and we will study the concept of probability distribution in both cases. Even if they are related, some differences exist.

Anyway, the probability distribution of a random variable x describes the probability of each outcome (a probability of 1 meaning that the variable will always take this value and a probability of 0 that it will never be encountered). This function is called probability distribution. More specifically, it is called the probability mass function for a discrete variable and probability density function for a continuous variable.


3.3.1 Discrete Variable and Probability Mass Function

The probability mass function is the function which describes the probability associated with the random variable x. This function is named P(x) or P(x=x) to avoid confusion. P(x=x) corresponds to the probability that the random variable x take the value x (note the different typefaces).


Example 2.



Dice experiment illustrating discrete random variable and probability mass function

Let’s roll a die an infinite number of times and look at the proportion of 1, the proportion of 2 and so on. We call x the random variable that corresponds to the outcome of the dice roll. Thus the random variable x can only take the following discrete values: 1, 2, 3, 4, 5 or 6. It is thus a discrete random variable.

The aim of the probability mass function is to describe the probability of each possible value. In our example, it describes the probability to get a 1, the probability to get a 2 and so on. In the case of a dice rolling experiment, we have the same probability to get each value (if we assume that the die is perfect). This means that we can write:


Now, how can we calculate the probabilities P(x=1), P(x=2) etc.? Since we have 6 possible outcomes and that they are equiprobable we have:


By the way, this distribution shows the same probability for each value: it is called the uniform distribution.

The probability mass function could look something like:


Probability mass function of the dice experiment

The y-axis gives the probability and x-axis the outcome.

Let’s reproduce this first example in code to be sure that everything is clear:

num_throws = 10000
outcomes = np.zeros(num_throws)
for i in range(num_throws):
    # let's roll the die
    outcome = np.random.choice(['1', '2', '3', '4', '5', '6'])
    outcomes[i] = outcome

val, cnt = np.unique(outcomes, return_counts=True)
prop = cnt / len(outcomes)

# Now that we have rolled our die 10000 times, let's plot the results, prop)


I created an array filled with 0 with the Numpy function zeros(). At each throw, I chose a value among the 6 possibilities. Then, I used the Numpy function unique() with the parameter return_counts set to True to get the number of each possible outcome. I plotted the proportion for each possible value.

We can see that the distribution looks like a uniform distribution and that each outcome has a probability of around 1/6 (≈0.17).


Joint probability distribution

Now let’s see what happens if we roll two dice. For each die, the outcomes are associated with a certain probability. We need two random variables to describe the game, let’s say that x corresponds to the first die and y to the second one. We also have two probability mass functions associated with the random variables: P(x) and P(y). Here the possible values of the random variables (1, 2, 3, 4, 5 or 6) and the probability mass functions are actually the same for both dice, but it doesn’t need to be the case.

The joint probability distribution is useful in the cases where we are interested in the probability that x takes a specific value while y takes another specific value. For instance, what would be the probability to get a 1 with the first dice and 2 with the second dice? The probabilities corresponding to every pair of values are written P(x=x,y=y) or P(x,y). This is what we call the joint probability.


Example 3.

For example, let’s calculate the probability to have a 1 with the first dice and a 2 in the second:



Properties of a probability mass function

A function is a probability mass function if:


The symbol ∀ means “for any”. This means that for every possible value x in the range of x (in the example of a die rolling experiment, all possible values were 1, 2, 3, 4, 5 and 6), the probability that the outcome corresponds to this value is between 0 and 1. A probability of 0 means that the event is impossible and a probability of 1 means that you can be sure that the outcome will correspond to this value.

In the example of the dice, the probability of each possible value is 16 which is between 0 and 1. This property is fulfilled.


This means that the sum of the probabilities associated with each possible value is equal to 1.

In the example of the dice experiment, we can see that there are 6 possible outcomes, each with a probability of 1/6 giving a total of 1/6×6=1. This property is fulfilled.


3.3.2 Continuous Variable and Probability Density Function

Some variables are not discrete. They can take an infinite number of values in a certain range. But we still need to describe the probability associated with outcomes. The equivalent of the probability mass function zfor a continuous variable is called the probability density function.

In the case of the probability mass function, we saw that the y-axis gives a probability. For instance, in the plot we created with Python, the probability to get a 1 was equal to 1/6≈0.16 (check on the plot above). It is 1/6 because it is one possibility over 6 total possibilities.

However, we can’t do this for continuous variables because the total number of possibilities is infinite. For instance, if we draw a number between 0 and 1, we have an infinite number of possible outcomes (for instance 0.320502304…). In the example above, we had 6 possible outcomes, leading to probabilities around 1/6. Now, we have each probability equal to 1/+∞≈0. Such a function would not be very useful.

For that reason, the y-axis of the probability density function doesn’t represent probability values. To get the probability, we need to calculate the area under the curve (we will see below some details about the area under the curve). The advantage is that it leads to the probabilities according to a certain range (on the x-axis): the area under the curve increases if the range increases. Let’s see some examples to clarify all of this.


Example 4.

Let’s say that we have a random variable x that can take values between 0 and 1. Here is its probability density function:


Probability density function

We can see that 0 seems to be not possible (probability around 0) and neither 1. The pic around 0.3 means that will get a lot of outcomes around this value.

Finding probabilities from probability density function between a certain range of values can be done by calculating the area under the curve for this range. For example, the probability of drawing a value between 0.5 and 0.6 corresponds to the following area:


Probability density function and area under the curve between 0.5 and 0.6.

We can easily see that if we increase the range, the probability (the area under the curve) will increase as well. For instance, for the range of 0.5-0.7:


Probability density function and area under the curve between 0.5 and 0.7.

We will see in a moment how to calculate the area under the curve and get the probability associated with a specific range.