Must Know Tips for Deep Learning Neural Networks, Part 2
Deep learning is white hot research topic. Add some solid deep learning neural network tips and tricks from a PhD researcher.
By Xiu-Shen Wei, Nanjing University.
Editor's note: Yesterday we posted the first 4 tips & ticks of this fascinating post. Be sure to have a look before reading the second half.
5. Activation Functions
One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.
The sigmoid non-linearity has the mathematical form
𝜎(x)=1/(1+e-x). It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).
Figures courtesy of Stanford CS231n.
In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:
Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during back-propagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.
Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g.,
x > 0 element wise in
f = wTx+b), then the gradient on the weights w will during back-propagation become either all be positive, or all negative (depending on the gradient of the whole expression f). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
Rectified Linear Unit
The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function
f(x)=max(0,x), which is simply thresholded at zero.
There are several pros and cons to using the ReLUs:
1. (Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.
2. (Pros) It was found to greatly accelerate (e.g., a factor of 6 in ) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
3. (Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.
Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when
x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes
f(x) = αx if
x < 0 and
f(x) = x if
x ≥ 0, where α is a small constant. Some people report success with this form of activation function, but the results are not always consistent.
Nowadays, a broader class of activation functions, namely the rectified unit family, were proposed. In the following, we will talk about the variants of ReLU.
ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, αi is learned and for Leaky ReLU αi is fixed. For RReLU, αji is a random variable keeps sampling in a given range, and remains fixed in testing.
The first variant is called parametric rectified linear unit (PReLU) . In PReLU, the slopes of negative part are learned from data rather than pre-defined. He et al.  claimed that PReLU is the key factor of surpassing human-level performance on ImageNet classification task. The back-propagation and updating process of PReLU is very straightforward and similar to traditional ReLU, which is shown in Page. 43 of the slides.
The second variant is called randomized rectified linear unit (RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. As mentioned in , in a recent Kaggle National Data Science Bowl (NDSB) competition, it is reported that RReLU could reduce overfitting due to its randomized nature. Moreover, suggested by the NDSB competition winner, the random αi in training is sampled from
1/U(3,8) and in test time it is fixed as its expectation, i.e.,
In , the authors evaluated classification performance of two state-of-the-art CNN architectures with different activation functions on the CIFAR-10, CIFAR-100 and NDSB data sets, which are shown in the following tables. Please note that, for these two networks, activation function is followed by each convolutional layer. And the
a in these tables actually indicates 1/α, where α is the aforementioned slopes.
From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slope α will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100. In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, He et al. also reported similar conclusions in .