A Concise Overview of Standard Modelfitting Methods
A very concise overview of 4 standard modelfitting methods, focusing on their differences: closedform equations, gradient descent, stochastic gradient descent, and minibatch learning.
Pages: 1 2
3) Stochastic Gradient Descent (SGD)
In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set  thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).
In Stochastic Gradient Descent (SGD; sometimes also referred to as iterative or online GD), we don't accumulate the weight updates as we've seen above for GD:
Instead, we update the weights after each training sample:
Here, the term "stochastic" comes from the fact that the gradient based on a single training sample is a "stochastic approximation" of the "true" cost gradient. Due to its stochastic nature, the path towards the global cost minimum is not "direct" as in GD, but may go "zigzag" if we are visualizing the cost surface in a 2D space. However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex (or pseudoconvex)[1]. Furthermore, there are different tricks to improve the GDbased learning, for example:
 An adaptive learning rate η Choosing a decrease constant d that shrinks the learning rate over time:
 Momentum learning by adding a factor of the previous gradient to the weight update for faster updates:
A note about shuffling
There are several different flavors of SGD, which can be all seen throughout the literature. Let's take a look at the three most common variants:
In scenario A [3], we shuffle the training set only one time in the beginning; whereas in scenario B, we shuffle the training set after each epoch to prevent repeating update cycles. In both scenario A and scenario B, each training sample is only used once per epoch to update the model weights.
In scenario C, we draw the training samples randomly with replacement from the training set [2]. If the number of iterations tis equal to the number of training samples, we learn the model based on a bootstrap sample of the training set.
4) MiniBatch Gradient Descent (MBGD)
MiniBatch Gradient Descent (MBGD) a compromise between batch GD and SGD. In MBGD, we update the model based on smaller groups of training samples; instead of computing the gradient from 1 sample (SGD) or all n training samples (GD), we compute the gradient from 1 < k < n training samples (a common minibatch size is k=50).
MBGD converges in fewer iterations than GD because we update the weights more frequently; however, MBGD let's us utilize vectorized operation, which typically results in a computational performance gain over SGD.
References
[1] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 9780521652636
[2] Bottou, Léon. "Largescale machine learning with SGD." Proceedings of COMPSTAT'2010. PhysicaVerlag HD, 2010. 177186.
[3] Bottou, Léon. "SGD tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421436.
Bio: Sebastian Raschka is a 'Data Scientist' and Machine Learning enthusiast with a big passion for Python & open source. Author of 'Python Machine Learning'. Michigan State University.
Original. Reposted with permission.
Related:
Pages: 1 2
Top Stories Past 30 Days

