Which methods should be used for solving linear regression?
As a foundational set of algorithms in any machine learning toolbox, linear regression can be solved with a variety of approaches. Here, we discuss. with with code examples, four methods and demonstrate how they should be used.
By Ahmad BinShafiq, Machine Learning Student.
Linear Regression is a supervised machine learning algorithm. It predicts a linear relationship between an independent variable (y), based on the given dependant variables (x), such that the independent variable (y) has the lowest cost.
Different approaches to solve linear regression models
There are many different methods that we can apply to our linear regression model in order to make it more efficient. But we will discuss the most common of them here.
- Gradient Descent
- Least Square Method / Normal Equation Method
- Adams Method
- Singular Value Decomposition (SVD)
Okay, so let’s begin…
One of the most common and easiest methods for beginners to solve linear regression problems is gradient descent.
How Gradient Descent works
Now, let's suppose we have our data plotted out in the form of a scatter graph, and when we apply a cost function to it, our model will make a prediction. Now this prediction can be very good, or it can be far away from our ideal prediction (meaning its cost will be high). So, in order to minimize that cost (error), we apply gradient descent to it.
Now, gradient descent will slowly converge our hypothesis towards a global minimum, where the cost would be lowest. In doing so, we have to manually set the value of alpha, and the slope of the hypothesis changes with respect to our alpha’s value. If the value of alpha is large, then it will take big steps. Otherwise, in the case of small alpha, our hypothesis would converge slowly and through small baby steps.
Hypothesis converging towards a global minimum. Image from Medium.
The Equation for Gradient Descent is
Implementing Gradient Descent in Python
Model after Gradient Descent.
Here first, we have created our dataset, and then we looped over all our training examples in order to minimize our cost of hypothesis.
Important advantages of Gradient Descent are
- Less Computational Cost as compared to SVD or ADAM
- Running time is O(kn²)
- Works well with more number of features
Important cons of Gradient Descent are
- Need to choose some learning rate α
- Needs many iterations to converge
- Can be stuck in Local Minima
- If not proper Learning Rate α, then it might not converge.
Least Square Method
The least-square method, also known as the normal equation, is also one of the most common approaches to solving linear regression models easily. But, this one needs to have some basic knowledge of linear algebra.
How the least square method works
In normal LSM, we solve directly for the value of our coefficient. In short, in one step, we reach our optical minimum point, or we can say only in one step we fit our hypothesis to our data with the lowest cost possible.
Before and after applying LSM to our dataset. Image from Medium.
The equation for LSM is
Implementing LSM in Python
Here first we have created our dataset and then minimized the cost of our hypothesis using the
b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
code, which is equivalent to our equation.
Important advantages of LSM are:
- No Learning Rate
- No Iterations
- Feature Scaling Not Necessary
- Works really well when the Number of Features is less.
Important cons are:
- Is computationally expensive when the dataset is big.
- Slow when Number of Features is more
- Running Time is O(n³)
- Sometimes, your X transpose X is non-invertible, i.e., a singular matrix with no inverse. You can use np.linalg.pinv instead of np.linalg.inv to overcome this problem.
ADAM, which stands for Adaptive Moment Estimation, is an optimization algorithm that is widely used in Deep Learning.
It is an iterative algorithm that works well on noisy data.
It is the combination of RMSProp and Mini-batch Gradient Descent algorithms.
In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
We compute the decaying averages of past and past squared gradients respectively as follows:
As mt and vt are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e., β1β1 and β2β2 are close to 1).
They counteract these biases by computing bias-corrected first and second-moment estimates:
They then update the parameters with:
Pseudocode for Adam is
Source: Arxiv Adam.
Let’s see it’s code in Pure Python.
Now Let’s find the actual graph of Linear Regression and values for slope and intercept for our dataset.
Now let us see the Linear Regression line using the Seaborn regplot function.
Let us code Adam Optimizer now in pure Python.
Here, we have implemented all the equations mentioned in the pseudocode above using an object-oriented approach and some helper functions.
Let us now set the hyperparameters for our model.
And finally, the training process.
Now, if we compare the Final theta values to the slope and intercept values, calculated earlier using scipy.stats.mstat.linregress, they are almost 99% equal and can be 100% equal by adjusting the hyperparameters.
Finally, let us plot it.
And we can see that our plot is similar to plot obtained using sns.regplot.
- Straightforward to implement.
- Computationally efficient.
- Little memory requirements.
- Invariant to diagonal rescale of the gradients.
- Well suited for problems that are large in terms of data and/or parameters.
- Appropriate for non-stationary objectives.
- Appropriate for problems with very noisy/or sparse gradients.
- Hyper-parameters have intuitive interpretation and typically require little tuning.
- Adam and RMSProp are highly sensitive to certain values of the learning rate (and, sometimes, other hyper-parameters like the batch size), and they can catastrophically fail to converge if e.g., the learning rate is too high. (Source: stackexchange)
Singular Value Decomposition
Singular value decomposition shortened as SVD is one of the famous and most widely used dimensionality reduction methods in linear regression.
SVD is used (amongst other uses) as a preprocessing step to reduce the number of dimensions for our learning algorithm. SVD decomposes a matrix into a product of three other matrices (U, S, V).
Once our matrix has been decomposed, the coefficients for our hypothesis can be found by calculating the pseudoinverse of the input matrix X and multiplying that by the output vector y. After that, we fit our hypothesis to our data, and that gives us the lowest cost.
Implementing SVD in Python
Though it is not converged very well, it is still pretty good.
Here first, we have created our dataset and then minimized the cost of our hypothesis using b = np.linalg.pinv(X).dot(y), which is the equation for SVD.
- Works better with higher dimensional data
- Good for gaussian type distributed data
- Really stable and efficient for a small dataset
- While solving linear equations for linear regression, it is more stable and the preferred approach.
- Running time is O(n³)
- Multiple risk factors
- Really sensitive to outliers
- May get unstable with a very large dataset
As of now, we have learned and implemented gradient descent, LSM, ADAM, and SVD. And now, we have a very good understanding of all of these algorithms, and we also know what are the pros and cons.
One thing we noticed was that the ADAM optimization algorithm was the most accurate, and according to the actual ADAM research paper, ADAM outperforms almost all other optimization algorithms.
- Linear to Logistic Regression, Explained Step by Step
- A Beginner’s Guide to Linear Regression in Python with Scikit-Learn
- Linear Regression In Real Life