Custom Optimizer in TensorFlow
How to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function using TensorFlow.
By Benoit Descamps, BigData Republic.
Neural Networks play a very important role when modeling unstructured data such as in Language or Image processing. The idea of such networks is to simulate the structure of the brain using nodes and edges with numerical weights processed by activation functions. The output of such networks mostly yield a prediction, such as a classification. This is achieved by optimizing on a given target using some optimisation loss function.
In a previous post, we already discussed the importance of customizing this loss function, for the case of gradient boosting trees. In this post, we shall discuss how to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function.
While the architecture of the Neural Network plays an important role when extracting information from data, all (most) are being optimized through update rules based on the gradient of the loss function.
The update rules are determined by the Optimizer. The performance and update speed may heavily vary from optimizer to optimizer. The gradient tells us the update direction, but it is still unclear how big of a step we might take. Short steps keep us on track, but it might take a very long time until we reach a (local) minimum. Large steps speed up the process, but it might push us off the right direction.
Research has been done into finding new optimizers, either by generating fixed numerical updates or algebraic rules.
Using a controller Recurrent Neural Network, a team  found two new interesting types of optimizers, PowerSign and AddSign, which are both performant and require less ressources than the current popular optimizers, such as Adam.
Implementing Optimizers in TensorFlow
Tensorflow is a popular python framework for implementing neural networks. While the documentation is very rich, it is often a challenge to find your way through it.
In this blog post, I shall explain how one could implement PowerSign and AddSign.
The optimizers consists of two important steps:
- compute_gradients() which updates the gradients in the computational graph
- apply_gradients() which updates the variables
Before running the Tensorflow Session, one should initiate an Optimizer as seen below:
tf.train.GradientDescentOptimizer is an object of the class GradientDescentOptimizer and as the name says, it implements the gradient descent algorithm.
The method minimize() is being called with a “cost” as parameter and consists of the two methods compute_gradients() and then apply_gradients().
For this post, and the implementation of AddSign and PowerSign, we must have a closer look at this last step apply_gradients().
This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().
_create_slots() and _prepare() create and initialise additional variables, such as momentum.
_apply_dense(), and _apply_sparse() implement the actual Ops, which update the variables. Ops are generally written in C++ . Without having to change the C++ header yourself, you can still return a python wrapper of some Ops through these methods.
This is done as follows:
Let us now put everything together and show the implementation of PowerSign and AddSign.
First, you need the following modules for adding Ops,
Let us now implement AddSign and PowerSign. Both optimizers are actually very similar and make use of the sign of the momentum m-hat and gradient g-hat for the update.
For PowerSign the update of the variables w_(n+1) at the (n+1)-th epoch, i.e.,
The decay-rate f_n in the following code is set to 1. I will not discuss this here, and I refer to the paper  for more details.
AddSign is very similar to PowerSign as seen below,
Performance testing the Optimizers
The Rosenbrock function is a famous performance test for optimization algorithms. The function is non-convex, and defined as,
The resulting shape is plotted in figure (1) below. As we seen, it has a minimum at x = 1 and y = 1.
The following script generates the Euclidian distance of the true minimum w.r.t the approximated minimum by a given optimizer at each epoch.
A performance comparison of each optimizer is plotted below for a run of 4000 epochs.
While the performance heavily vary from the choice of hyperparameters, the extremely fast convergence of PowerSign needs to noticed.
Below, the coordinates of the approximations have been plotted for several epochs.
|Epoch||Rmsprop (x,y,z)||AddSign (x,y,z)||PowerSign (x,y,z)|
|0||(-2.39, -1.57, 4.26)||(-2.39, -1.57, 4.26)||(-2.39, -1.57, 4.26)|
|501||(0.66, 0.43, 0.13)||(0.41, 0.17, 0.34)||(0.97, 0.95, 0.0)|
|1001||(0.83, 0.67, 0.05)||(0.55, 030, 0.21)||(0.98, 0.96, 0.00)|
|2001||(0.93, 0.85, 0.03)||(0.69, 0.48, 0.09)||(0.98, 0.96, 0.00)|
|3001||(0.96, 0.92, 0.02)||(0.78, 0.60, 0.05)||(0.98, 0.97, 0.00)|
Tensorflow allows us to create our own customizers. Recent progress in research have delivered two new promising optimizers,i.e. PowerSign and AddSign.
The fast early convergence of PowerSign makes it an interesting optimizer to combine with others such as Adam.
- Additional information on PowerSign and AddSign is available on arxiv paper “Neural Optimizer Search with Reinforcement Learning” , Bello et. al., https://arxiv.org/abs/1709.07417.
- Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.
- I have found a lot of useful information through this stackerflow post, which I have attempted to bundle into this post.
Original. Reposted with permission.
- Deep Learning Made Easy with Deep Cognition
- Understanding Objective Functions in Neural Networks
- TensorFlow: What Parameters to Optimize?