Three Impactful Machine Learning Topics at ICML 2016
This post discusses 3 particular tutorial sessions of impact from the recent ICML 2016 conference held in New York. Check out some innovative ideas on Deep Residual Networks, Memory Networks for Language Understanding, and Non-Convex Optimization.
By Robert Dionne, init.ai.
The International Conference on Machine Learning (ICML) is the leading international academic conference in machine learning, attracting 2000+ participants. This year it was held in NYC and I attended on behalf of Init.ai. Three of the tutorial sessions I attended were quite impactful. Anyone working on conversational apps, chatbots, and deep learning would be interested in these topics.
- Deep Residual Networks: Deep Learning Gets Way Deeper by Kaiming He (slides)
- Memory Networks for Language Understanding by Jason Weston(slides)
- Recent Advances in Non-Convex Optimization and its Implications to Learning by Anima Anandkumar (slides)
Deep Residual Networks
I’ve written before about Residual Neural Network research, but listening to Kaiming was informative. In the talk, he described motivations for increasing the depth of neural networks. He demonstrated obstacles to increasing depth and initial solutions. Additionally, He showed how residual networks increase accuracy with increased depth beyond these initial solutions. Moreover, Kaiming justified using identity mappings in both the shortcut connection and the post-addition operation. Finally, He gave empirical results that ResNets yield representations generalizing to many problems.
Kaiming showed how deeper neural networks had won recent ImageNet competitions. Yet, extending them beyond a depth of about twenty layers decreases performance.
A few techniques are enough to get this far. Careful weight initialization and batch normalization enable networks to train beyond ten layers.
Weight initialization reduces vanishing and exploding behavior in the forward and backward signals. For healthy propagation, one should force the product of all layers’ scaled variances to be constant. Thus, one should rescale the scaled variance of each layer to be one. For a linear activation, one can use:
From slide 19.
For a rectified-linear (ReLU) activation, one can use:
From slide 20.
For a rectified-linear network with 22 layers, initializing with the second equation converges faster. The same network with 30 layers requires the second form to progress at all. The second form makes sense because ReLU drops half of the input space.
Batch normalization rescales each layer for each minibatch. It reduces the training’s sensitivity to initial weights. For each layer and minibatch, one calculates the mean and standard deviation of inputs x. Then the layer rescales its input and applies a (component-wise) linear transformation with parameters γ and β.
Despite these techniques, increasing depth another order-of-magnitude decreases performance. Yet by construction, one can trivially add identity layers to get a deeper net with the same accuracy.
Residual learning bypasses this barrier and improves accuracy with more layers.
To deepen another 10x to 1000, He replaces the after-addition mapping with the identity function. Traditional ResNets used ReLU after the addition. Deeper ResNets use the identity. He shows several reasonable post-add activation functions result in multiplicative behavior, and reduce performance.
The identity activation smoothly propagates signal from all previous layers lto the L-th layer:
Similarly, it smoothly propagates error from all later layers L to the l-th layer:
To conclude, Kaiming showed results of transferring features from ResNets on image classification. Using ResNet features on localization, detection, and segmentation tasks is more accurate by 8.5%. Also, human pose estimation and depth estimation transfer well. ResNets show promise in image generation, natural language processing, speech recognition and advertising tasks.
Here are two implementations Kaiming highlighted: