# A Concise Overview of Standard Model-fitting Methods

A very concise overview of 4 standard model-fitting methods, focusing on their differences: closed-form equations, gradient descent, stochastic gradient descent, and mini-batch learning.

In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:

In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samplesÂ *i*Â in our dataset of sizeÂ *n*.

Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:

- Solving the model parameters analytically (closed-form equations)
- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc.)

### 1) Normal Equations (closed-form solution)

The closed-form solution may (should) be preferred for "smaller" datasets -- if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse ofÂ XTXÂ may not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:

whereÂ *y*Â is the response variable,Â *x*Â is anÂ *m*-dimensional sample vector, andÂ *w*Â is the weight vector (vector of coefficients). Note thatÂ *w0*Â represents the y-axis intercept of the model and thereforeÂ *x0=1*. Using the closed-form solution (normal equation), we compute the weights of the model as follows:

### 2) Gradient Descent (GD)

Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).

The cost functionÂ *J(â‹…)*, the sum of squared errors (SSE), can be written as:

The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient

whereÂ *Î·*Â is the learning rate. The weights are then updated after each epoch via the following update rule:

whereÂ Î”wÂ is a vector that contains the weight updates of each weight coefficientÂ *w*, which are computed as follows:

Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows: