A Concise Overview of Standard Modelfitting Methods
A very concise overview of 4 standard modelfitting methods, focusing on their differences: closedform equations, gradient descent, stochastic gradient descent, and minibatch learning.
In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:
In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the bestfitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n.
Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:
 Solving the model parameters analytically (closedform equations)
 Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc.)
1) Normal Equations (closedform solution)
The closedform solution may (should) be preferred for "smaller" datasets  if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse of XTX may not exist (the matrix is noninvertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:
where y is the response variable, x is an mdimensional sample vector, and w is the weight vector (vector of coefficients). Note that w0 represents the yaxis intercept of the model and therefore x0=1. Using the closedform solution (normal equation), we compute the weights of the model as follows:
2) Gradient Descent (GD)
Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).
The cost function J(⋅), the sum of squared errors (SSE), can be written as:
The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient
where η is the learning rate. The weights are then updated after each epoch via the following update rule:
where Δw is a vector that contains the weight updates of each weight coefficient w, which are computed as follows:
Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows:
Pages: 1 2
Top Stories Past 30 Days

