Probability Mass and Density Functions
This content is part of a series about the chapter 3 on probability from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts.
Properties of the probability density function
These differences between the probability mass functions and the probability density function lead to different properties for the probability density function:
In this case, p(x) is not necessarily less than 1 because it doesn’t correspond to the probability (the probability itself will still need to be between 0 and 1).
For instance, let’s say that we have a continuous random variable that can take values between 0 and 0.5. This variable is described by a uniform distribution so we will have the following probability distribution function:
We can see that the y-values are greater than 1. The probability is given by the area under the curve and thus it depends on the x-axis as well.
???? If you are like to see this by yourself, we will reproduce this example in Python. To do that we will create a random variable x that can take a value between 0 and 0.5 randomly. The uniform distribution will be used thanks to the Numpy function
random.uniform(). The parameters of this function are the lowest value (included), the highest value (not included) and the number of samples. So
np.random.uniform(0, 0.5, 10000) will create 10000 value randomly chosen to be ≥0 and <0.5.
Looks good! ????????♀️
We can see that the shape looks like what I draw above with y-axis values around 2 for all x between 0 and 0.5.
However, one thing can be intriguing in this plot. We talked about continuous variable and here we have represented the distribution with bars. The explanation is the same as before: we need to discretise the function to count the number of outcomes in each interval. Actually, the number of intervals is a parameter of the function
distplot(). Let’s try to use a lot of bins:
We can see that we are still around 2 but that the variability is greater than before (the bars can go from 1 to 4 which was not the case in the last plot). Any idea why?????
????This is because since we took more bins, a smaller number of values were in each bin leading to a less accurate estimate. If this hypothesis is true, we could reduce this variability by increasing the number of samples. Let’s try that:
That’s great ????????♂️
We can now go to the second property!
For the probability mass function, we have seen that the sum of the probabilities has to be equal to 1. This is not the case for the probability density functions since the probability corresponds to the area under the curve and not directly to y values. However, this means that the area under the curve has to be equal to 1.
We saw in the last example, that the area was actually equal to 1. It can be easily obtained and visualised because of the squared shape of the uniform distribution. It is thus possible to multiply the height by the width: 2×0.5=1.
However, in many cases, the shape is not a square and we still need to calculate the area under the curve. Let’s see how to do this!
???? Area under the curve
The area under the curve of a function for a specific range of values can be calculated with the integral of the function. We will see that calculating the integral of a function is the opposite of calculating the derivative. This means that if you derive a function f(x) and calculate the integral of the resulting function f′(x) you will get back f(x).????
The derivative at a point of a function gives its rate of change. What is the link between the function describing the rate of change of another function (the derivative) and the area under the curve ?????
Let’s start with a point on derivative! And then, with the next graphical example, it will be crystal clear. ????
We want to modelise the speed of a vehicle. Let’s say that the function f(x)=x2
define its speed (y-axis) in function of time (x-axis).
First, we will plot the function f(x)=x2it to see its shape:
The shape is a parabola! It shows that the speed increases slowly at the beginning but increases more and more for a constant duration.
I have created a variable x (with the function
arange() from Numpy) that contains all the points of the x-axis. So it is just all values from -10 to 10 with a step of 0.1. Let’s see the first 10 values.
Here is the doc of the
arange() function from Numpy.
In our example, the function defines the speed of the vehicle in function of time so it doesn’t make sense to have negative values. Let’s take only the positive part of the x-axis to avoid negative time (we’ll say that 0 is the start of the experiment).
Ok, that’s better!
The derivative of this function is f′(x)=2x. To have more information about derivative rules, check here.
Here is a plot of f′(x):
This representation of the derivative shows the acceleration. f(x) described the speed of the vehicle in function of time and the derivative f′(x) shows the rate of change of the speed in function of time, that is the acceleration.
We can see that the acceleration of the vehicle increases linearly with time. The derivative tells us that the rate of change of the vehicle speed is 2x. For instance, when x=0, the rate of change is equal to 2×0=0, so the speed is not changing. When x=3, the rate of change is 2×3=6. This means that at this point, the speed is increased by 6 when time is increased by 1. To summarise, the derivative of a function gives its rate of change. In our example, the rate of change was another function (f′(x)=2x) but it can be a constant (the rate of change is always the same, e.g. f′(x)=2) or a quadratic function for instance (e.g. f′(x)=x3).
Being able to calculate derivatives is very powerful but is it possible to do the reverse: going from the rate of change to the change itself ????. Whoah, this is cool! The answer is given by the integral of a function.
The integral of f′(x) gives us f(x) back. The notation is the following:
This means that we take f′(x) to get back f(x). The notation dx here means that we integrate over x, that is to say, that we sum slices weighted by y (see here).
If we take again the last example we have:
We can see that there is a difference: the addition of a constant c. This is because an infinite number of function could have given the derivative 2x (for instance x2+1 or x2+294…). We lose a bit of information and we can’t recover it.
And now, the graphical explanation (I love this one ????): we have seen that 2x is the function describing the rate of change (the slope) of the function x2. Now if we go from f′(x) to f(x) we can see that the area under the curve of f′(x) correspond to f(x):
This plot shows the function f′(x)=2x and we can see that the area under the curve increases exponentially. This area is represented for different ranges ([0-0], [0-1], [0-2], [0-3]). We can calculate the area under the curve (using the Pythagorean theorem and dividing by 2 since the areas are half squares) and find the following values: 0, 1, 4, 9… This corresponds to the original function f(x)=x2! ????
To summarise, we have seen what is a random variable and how the distribution of probabilities can be expressed for discrete (probability mass function) and continuous variable (probability density function). We also studied the concept of joint probability distribution and bedrock math tools like derivatives and integrals.
You now have all the tools to dive more into probability. The next part will be about the chapters 3.4 to 3.8. We will see what we called marginal and conditional probability, the chain rule and the concept of independence.
I hope that this helped you to gain a better intuition on all of this! Feel free to contact me about any question/note/correction! ????
Bio: Hadrien Jean is a machine learning scientist at Ava Accessibility and has personal interests in Full Stack Web Development and Data Science.
Original. Reposted with permission.
- Boost your data science skills. Learn linear algebra.
- The Surprising Complexity of Randomness
- Mathematical programming — Key Habit to Build Up for Advancing Data Science