5 More arXiv Deep Learning Papers, Explained

Top recent deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.

Gradient Descent

3. Speed Learning on the Fly

Authors: Pierre-Yves Massé, Yann Ollivier
Date posted to arXiv: 8 Nov 2015

Abstract (excerpt):

Here we propose to adapt the step size by performing a gradient descent on the step size itself, viewing the whole performance of the learning trajectory as a function of step size. Importantly, this adaptation can be computed online at little cost, without having to iterate backward passes over the full data.

Hugo's Two Cents (excerpt):

I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about.

At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself.

Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention.

4. Spatial Transformer Networks

Authors: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
Date posted to arXiv: 5 Jun 2015

Abstract (excerpt):

In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process.

Hugo's Two Cents (excerpt):

While the work on DRAW (http://arxiv.org/abs/1502.04623) previously proposed a similar approach to learning transformations on images, this work goes significantly beyond DRAW and generalizes the approach to a much richer family of transformations. I also really like the idea of applying the spatial transformer modules within a CNN, something that wasn't in the DRAW paper.

I really don't have much negative to say about this work, it's really solid!

The only thing that comes to mind is that, in the CUB-200-2011 experiment, the authors used ImageNet pre-trained Inception networks to initialise their models. The only reason it's worth mentioning is that the test set of the CUB-200-2011 dataset actually contains images from the ImageNet training set. But fortunately, there are very few of those, so this doesn't change the overall analysis of the results. Still, I do find it interesting that, with such forms of transfer learning becoming increasingly common, it appears that we, as deep learning researchers, will need to start paying much more attention to such considerations in the future than we used to.

5. Clustering is Efficient for Approximate Maximum Inner Product Search

Authors: Alex Auvolat, Sarath Chandar, Pascal Vincent, Hugo Larochelle, Yoshua Bengio
Date posted to arXiv: 21 Jul 2015

Abstract (excerpt):

Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm.

Hugo's Two Cents (excerpt):

Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.

(Editor's note: link to version 1)

Since inner products are one of the main units of computation in neural networks, I'm very interested in MIPS as I suspect it could play an important role in scaling up neural networks. One example mentioned in the paper is that of approximating computations at the output layer of a neural network language model, corresponding to a softmax over a large number of units (as many as words in the vocabulary).

I find the combination of the "MIPS to MCSS" transformation with spherical clustering clever, cute and simple. Based on how good the results are compared to hashing, I find this direction of research quite compelling.

I would like to thank Dr. Larochelle, not only for the fantastic summaries and insights that he has been producing for several months at this point, but also for being gracious enough to allow us to reproduce extended excerpts in this and the previous article. I hope that these notes, along with the original papers themselves, provide you with some additional comprehension of the often-difficult concepts that go along with deep learning research.