Neural Networks with Numpy for Absolute Beginners — Part 2: Linear Regression
In this tutorial, you will learn to implement Linear Regression for prediction using Numpy in detail and also visualize how the algorithm learns epoch by epoch. In addition to this, you will explore two layer Neural Networks.
As mentioned earlier, now that you have both the corresponding values for
X_train and the predicted values for
y_pred you’ll calculate the Cost/Error/Loss Function.
The Loss(Mean Squared Error) is:
Summing over all M examples, we obtain the Loss fn. as below:
Our goal is to obviously minimize the Loss so the regression line predicts more accurately.
Let us now codify this.
You will also save each value of Loss that will be computed to graphically visualize how it changes during training.
Let’s modify the above
plot_graph function defined above to plot the Loss too.
You’ll visualize the line created from the parameters m and b.
Now that you have computed the loss, let’s minimize it.
Gradient Descent for Linear Regression
Since, Loss is the dependent variable and m and b are the independent variables, we’ll have to update m & b so as to find the minimum Loss.
So, the immediate question would be…
How can I update the parameters m and b?
Let us for instance consider just a single parameter p as shown below and let t(target) be the value that has to be predicted. We see that as cost converges to the minima, the parameter p reaches a specific value called the optimal value. Let’s say the optimum value of p is a.
You can make a few observations from this graph.
It is clear from the graph, that as p moves towards a, the Cost decreases and as it moves away from it, the cost increases.
Now, how can we make p move towards a, regardless of whether it is on the left or to the right of a as shown in figure?
Let us consider the p of the curve. From calculus, we know that the slope of a curve at a point is given by dy/dx (here it is dL/dp where L → Loss). From the fig., when p is to the right of a, the slope is obviously ‒ve and when it’s to the right, the slope would be +ve. But we see that if p is to the left of a, some value must be added to p. Likewise, some value must be subtracted when p is to the right of a.
This means that when slope is ‒ve implies p = p + (some val.) and when slope is +ve implies p = p ‒ (some val.) to move towards a.
∴ We subtract the slope from p. This way, the slope is negated and it ensures that it always moves towards a. The resulting equation would be,
It must also be observed that if the cost is too high, the slope will be too high. Hence, while subtracting the slope from p, p value might overshoot a. Hence, it is necessary to decrease the value of slope so that p does not overshoot a. Therefore, we introduce a dampening factor called the Learning Rate (α) to the slope. You’ll see later that by varying α the rate of decrease in error varies.
What we finally obtain would be,
A shown in the figure, the trajectory taken by p against cost is that of a Bell curve.
This method is called the Gradient Descent.
In our case, we use two parameters m and b. Therefore, the Bell curve would be 3-dimensional as shown in the below figure.
As mentioned, you’ll compute the partial derivative of the loss function w.r.t to the parameters m & b. [Note: It is usually expected that you know the basic concepts of partial derivatives. However if you do not, you can refer this wonderful [Khan Academy video]
Updating the parameters
Now we subtract the slope of the parameters m and b from their respective derivatives along with the dampening factor α(alpha).
From decreasing the values of m and b, they are incrementally moving towards the minima. So updating the parameters this way has to be done for many iterations, which is called epochs.
Let us define a function
grad_desc, which calls both
We have now defined everything that we need, so let’s compile all the functions into one and see how our algorithm works. So, before you can actually run the code, you’ll have to set the hyperparameters.
Epoch: 0 Loss = 2934.082243250548 Epoch: 10 Loss = 1246.3617292447889
Epoch: 20 Loss = 546.310951004311
Epoch: 30 Loss = 255.88020867147344
Epoch: 40 Loss = 135.36914932067438
Epoch: 50 Loss = 85.35744394597806
Epoch: 60 Loss = 64.60029693013243
Since you have trained the parameters for 60 epochs and the regression line looks to be fitting the data, you can move forward to the last phase, i.e., prediction on our test data and checking the accuracy.
For checking the accuracy, you can take the mean of percentage error for all the test data points.
Prediction: Loss = 56.53060443946197 Accuracy = 80.1676% Hence m = 82.34083095217943 b = 0.46491578390750576
The accuracy is 80% which is “ok” considering the variance in the data as is seen in the above graphs.
I was hoping to introduce something really interesting in the article and as a bonus I have also added an intro to Neural Networks. But this surely comes with a catch!
Two Layer Neural Network with Linear Activation Function
The Neural Network is shown below.
From the image, we observe that there are two inputs each to the two neurons in the first layer and an output neuron in the second layer.
We will be using matrices for representing our above equations. We can represent them in vector (single column matrix) form as:
While doing matrix computations, we’ll need to take care of the dimensions and multiply. Hence, we rearrange a bit to arrive at the required output.
The expansion of the equation is not required and hence let’s stick to
Similarly, the value of
Now the output from the 2ⁿᵈ layer will be:
From the above set of equations, we see that a neural network with a linear activation function reduces to a linear equation.
The whole purpose of neural networks was to create a very complex function that can fit to any sort of data and as it can be clearly seen, a neural network with linear activation functions fails the purpose. Hence, it should be strictly noted that a linear function cannot be used as an activation function for the neural network, although it can be used only in the last layer for regression problems.
Then I guess you’ll have to hold your horses until the next tutorial to implement one!
Here’s the link for the full implementation in Jupyter Notebook:
Go ahead clone it and start running the cells on your Colab to see the miracles of Gradient Descent!!
In this tutorial, you learnt
- Linear Activation functions perform the tasks of regression i.e., learn to predict and forecast values. This method is called Linear Regression everywhere.
- An MLP(Multi-Layer Perceptron) with a linear activation function reduces to a normal Linear Regression task. Hence, linear activations must not be used in the hidden layers of a network. However, it can be used in the last layer for regression/prediction tasks.
In the next tutorial, you’ll learn about Sigmoid Activation Function and perform Logistic Regression which is the most important key to implement neural networks.
Are you working on any cool Deep Learning project?
You can connect with me on Linkedin: Suraj Donthi | LinkedIn
OR message me on Twitter for any queries: Suraj Donthi (@suraj_donthi) | Twitter
Bio: Suraj Donthi is a Computer Vision Consultant | Author | Machine Learning and Deep Learning Trainer.
Original. Reposted with permission.
- Linear Regression, Least Squares & Matrix Multiplication: A Concise Technical Overview
- Introduction to PyTorch for Deep Learning
- An Intuitive Introduction to Gradient Descent