Silver Blog, Aug 2017Train your Deep Learning model faster and sharper: Snapshot Ensembling — M models for the cost of 1

We explain a novel Snapshot Ensembling method for increasing accuracy of Deep Learning models while also reducing training time.



Deep neural networks have many, many learnable parameters that are used to make inferences. Often, this poses a problem in two ways: Sometimes, the model does not make very accurate predictions. It also takes a long time to train them. This post talks about increasing accuracy while also reducing training time using two very novel ways.

The papers can be found here (Snapshot ensembles) and here (FreezeOut).

This article assumes some familiarity with neural networks, including aspects like SGDminimaoptimisation, etc.

Editor: this post describes Snapshot ensembles, and here is the second part which explains Freezout.

1. Snapshot Ensembling — M models for the cost of 1

Regular Ensemble Models

Ensemble models are a group of models that work collectively to get the prediction. The idea is simple: Train several models using different hyperparameters , and average the prediction from all these models. This technique gives a great boost in accuracy because it is not relying on a single model for prediction. Most winning entries in high profile Machine Learning competitions have used ensembles.

So what’s the problem?

Training N different models will require N times the time required to train a single model. Most people who don’t have the luxury of having Multiple GPUs will often have to wait for a long time before they can test out these models. Therefore, it makes experimenting much slower.

SGD Mechanics

Before I tell you about the ‘novel’ approach, you must first understand the nature of Stochastic Gradient Descent(SGD). SGD is greedy, it will look for the steepest descent. However, there is one very crucial parameter that governs SGD — The Learning Rate.

If the learning rate is too high, SGD will ignore very narrow crevices(minima), and take large steps (think of a tank not being affected by a pothole on the road).

On the other hand, if the learning rate is small, SGD will fall inside one of these local minima and not be able to come out of it. It is, however, possible to bring SGD back from the local minima, by increasing the learning rate.

The trick?

The authors of the paper use this controllable property of SGD falling in and climbing out of local minima. Different local minima may have very similar error rates, but the mistakes that they will make will be differentfrom each other.

They have included a very useful diagram that explains this concept:

standard SGD
Figure 1.0: Left: standard SGD trying to find the best local minima. Right: SGD is made to fall into a local minima, then brought back up, and the process is repeated. This way you get 3 (which are labelled 1,2,3) local minima, each with similar error rates, but with different error characteristics.

What is being ensembled a.k.a snapshot?

The authors use the property of local minima having different ‘viewpoints’ on their predictions to create multiple models. Every time SGD reaches a local minima , a snapshot of that model is saved, which will be part of the final ensemble of networks.

Cyclic Cosine Annealing

Instead of manually trying to figure out when to dive into a local minima or when to jump out of it, the authors used a function to automate this process.

They used Learning Rate Annealing with the following function:

Learning Rate Annealing

Simplified

The formula may look complicated, but its quite simple. They used a monotonically decreasing functionα here is the new learning rate, and α0 is the old learning rate. T is the total number of training iterations you want to use (T should be equal to batchsize*number of epochs). M is the number of snapshots you want in your ensemble.

 

Figure1.1 M=6 , and Budget=300 epochs. The vertical dotted lines indicate a model snapshot. After 300 epochs a total of 6 models were added to the ensemble.

Notice how the loss falls rapidly just before each snapshot. This is because the learning rate decreases continuously. After snapshot, the learning rate is restarted back (they used the value of 0.1). This causes the gradient path to be brought out of the local minima (and new local minima search begins again).

Show me the numbers

I have included the numbers that the authors used to demonstrate the effectiveness of their method

error rate

Figure1.2 Error Rates(%) on Cifar10,Cifar100,SVHN and Tiny ImageNet. Blue indicates the authors’ work, and bold indicates the best error rate for that category

Conclusion

This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance.

Original. Reposted with permission.

Bio: Harshvardhan Gupta writes at HackerNoon.

Related: