Must Know Tips for Deep Learning Neural Networks
Deep learning is white hot research topic. Add some solid deep learning neural network tips and tricks from a PhD researcher.
By XiuShen Wei, Nanjing University.
Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the stateofthearts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.
In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also wellwritten CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs. Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.
Introduction
We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for imagerelated tasks, mainly in eight aspects:
 data augmentation
 preprocessing on images
 initializations of Networks
 some tips during training
 selections of activation functions
 diverse regularizations
 some insights found from figures
 methods of ensemble multiple deep networks
Additionally, the corresponding slides are available at [slide]. If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me.
1. Data Augmentation
Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance. Also, data augmentation becomes the thing must to do when training a deep network.
There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between 0.1 and 0.1. Also, you could add a value between [0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.
Krizhevsky et al. [1] proposed fancy PCA when training the famous AlexNet in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e., I_{xy}=[I_{xy}^{R},I_{xy}^{G},I_{xy}^{B}]^{T}): p_{1},p_{2},p_{3}][α_{1} λ_{1},α_{2}λ_{2},α_{3}λ_{3}]^{T} where, p_{i} and λ_{i} are the ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values, respectively, and α_{i} is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Please note that, each α_{i} is drawn only once for all the pixels of a particular training image until that image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce another α_{i} for data augmentation. In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top1 error rate by over 1% in the competition of ImageNet 2012.
2. PreProcessing
Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do preprocessing on these images/crops. In this section, we will introduce several approaches for preprocessing.
The first and simple preprocessing approach is zerocenter the data, and then normalize them, which is presented as two lines Python codes as follows:
>>> X = np.mean(X, axis = 0) # zerocenter >>> X /= np.std(X, axis = 0) # normalize
where, X is the input data (NumIns×NumDim). Another form of this preprocessing normalizes each dimension so that the min and max along the dimension is 1 and 1 respectively. It only makes sense to apply this preprocessing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.
Another preprocessing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:
>>> X = np.mean(X, axis = 0) # zerocenter >>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix
After that, you decorrelate the data by projecting the original (but zerocentered) data into the eigenbasis:
>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix >>> Xrot = np.dot(X, U) # decorrelate the data
The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:
>>> Xwhite = Xrot / np.sqrt(S + 1e5) # divide by the eigenvalues (which are square roots of the singular values)
Note that here it adds 1e5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e5 to be a larger number).
Please note that, we describe these preprocessing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zerocenter the data, and it is common to see normalization of every pixel as well.
Pages: 1 2
Top Stories Past 30 Days

