# A Concise Overview of Standard Model-fitting Methods

A very concise overview of 4 standard model-fitting methods, focusing on their differences: closed-form equations, gradient descent, stochastic gradient descent, and mini-batch learning.

In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:

In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples *i* in our dataset of size *n*.

Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:

- Solving the model parameters analytically (closed-form equations)
- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc.)

### 1) Normal Equations (closed-form solution)

The closed-form solution may (should) be preferred for "smaller" datasets -- if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse of XTX may not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:

where *y* is the response variable, *x* is an *m*-dimensional sample vector, and *w* is the weight vector (vector of coefficients). Note that *w0* represents the y-axis intercept of the model and therefore *x0=1*. Using the closed-form solution (normal equation), we compute the weights of the model as follows:

### 2) Gradient Descent (GD)

Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).

The cost function *J(⋅)*, the sum of squared errors (SSE), can be written as:

The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient

where *η* is the learning rate. The weights are then updated after each epoch via the following update rule:

where Δw is a vector that contains the weight updates of each weight coefficient *w*, which are computed as follows:

Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows: