#### Conclusion

This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance.

KDnuggets Home » News » 2017 » Aug » Tutorials, Overviews » Train your Deep Learning model faster and sharper: Snapshot Ensembling — M models for the cost of 1 ( 17:n30 )

We explain a novel Snapshot Ensembling method for increasing accuracy of Deep Learning models while also reducing training time.

Deep neural networks have many, many learnable parameters that are used to make inferences. Often, this poses a problem in two ways: Sometimes, the model does not make very accurate predictions. It also takes a long time to train them. This post talks about increasing accuracy while also reducing training time using two very novel ways.

The papers can be found here (**Snapshot ensembles**) and here (**FreezeOut**).

*This article assumes some familiarity with neural networks, including aspects like SGD, minima, optimisation, etc.*

Editor: this post describes

Snapshot ensembles, and here is the second part which explainsFreezout.

Ensemble models are a **group of models** that work collectively to get the prediction. The idea is simple: **Train several models** using different hyperparameters , and **average the prediction** from all these models. This technique gives a **great** **boost in accuracy** because it is not relying on a single model for prediction. Most winning entries in high profile Machine Learning competitions have used ensembles.

Training N different models will require N times the time required to train a single model. Most people who don’t have the luxury of having Multiple GPUs will often have to wait for a long time before they can test out these models. Therefore, it makes experimenting much slower.

Before I tell you about the ‘novel’ approach, you must first understand the nature of Stochastic Gradient Descent(SGD). SGD is greedy, it will look for the steepest descent. However, there is one very crucial **parameter** that **governs** SGD — **The Learning Rate**.

If the **learning rate** is too high, SGD will **ignore** very **narrow crevices**(minima), and take large steps (think of a tank not being affected by a pothole on the road).

On the other hand, if the learning rate is **small**, SGD will **fall** inside one of these **local minima** and not be able to come out of it. It is, however, possible to bring SGD back from the local minima, by **increasing** the **learning rate.**

The authors of the paper use this **controllable property** of SGD falling in and climbing out of local minima. **Different local minima** may have very **similar error rates**, but the **mistakes** that they will make will be **different**from **each other**.

They have included a very useful diagram that explains this concept:

**Figure 1.0: Left: standard SGD trying to find the best local minima. Right: SGD is made to fall into a local minima, then brought back up, and the process is repeated. This way you get 3 (which are labelled 1,2,3) local minima, each with similar error rates, but with different error characteristics**.

The authors use the property of local minima having different ‘viewpoints’ on their predictions to create multiple models. Every time SGD reaches a **local minima** , a **snapshot** of that model is **saved**, which will be part of the final ensemble of networks.

Instead of manually trying to figure out when to dive into a local minima or when to jump out of it, the authors used a function to automate this process.

They used Learning Rate Annealing with the following function:

The formula may look complicated, but its quite simple. They used a monotonically **decreasing** **function**. *α *here is the new learning rate, and α0 is the old learning rate. **T** is the total **number** of training **iterations** you want to use (T should be equal to batchsize*number of epochs). **M** is the **number** of **snapshots** you want in your ensemble.

** Figure1.1** M=6 , and Budget=300 epochs. The vertical dotted lines indicate a model snapshot. After 300 epochs a total of 6 models were added to the ensemble.

Notice how the loss falls rapidly just before each snapshot. This is because the learning rate **decreases** **continuously**. After snapshot, the learning rate is **restarted** back (they used the value of 0.1). This causes the gradient path to be brought out of the local minima (and new local minima search begins again).

I have included the numbers that the authors used to demonstrate the effectiveness of their method

*Figure1.2** Error Rates(%) on Cifar10,Cifar100,SVHN and Tiny ImageNet. Blue indicates the authors’ work, and bold indicates the best error rate for that category*

This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance.

Original. Reposted with permission.

**Bio:** Harshvardhan Gupta writes at HackerNoon.

**Related:**