A Concise Overview of Standard Model-fitting Methods

A very concise overview of 4 standard model-fitting methods, focusing on their differences: closed-form equations, gradient descent, stochastic gradient descent, and mini-batch learning.



In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:

Simple regresion

In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n.


SSE

Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:

  • Solving the model parameters analytically (closed-form equations)
  • Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc.)


1) Normal Equations (closed-form solution)


The closed-form solution may (should) be preferred for "smaller" datasets -- if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse of XTX may not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:

Linear model

where y is the response variable, x is an m-dimensional sample vector, and w is the weight vector (vector of coefficients). Note that w0 represents the y-axis intercept of the model and therefore x0=1. Using the closed-form solution (normal equation), we compute the weights of the model as follows:


Closed form


2) Gradient Descent (GD)


Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).

The cost function J(⋅), the sum of squared errors (SSE), can be written as:

J

The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient


dw

where η is the learning rate. The weights are then updated after each epoch via the following update rule:


w_upd

where Δw is a vector that contains the weight updates of each weight coefficient w, which are computed as follows:


w_upd explained

Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows:

GD optimization