# Top 5 arXiv Deep Learning Papers, Explained

Top deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.

arXiv, maintained by Cornell University, is a popular open access academic paper preprint repository. It is an outlet for cutting edge research in numerous scientific fields, including machine learning. Mirroring the current general trend in academia, much of the recent posted machine learning research is deep learning related.

Hugo Larochelle, PhD, is a UniversitÃ© de Sherbrooke machine learning professor (on leave), Twitter research scientist, noted neural network researcher, and deep learning aficiando. Since late summer 2015, he has been drafting and publicly sharing notes on arXiv machine learning papers that he has taken an interest in. At the time of this writing he has shared notes on 10 papers.

A selection of 5 arXiv machine learning papers that Hugo has read and shared notes on follows. In an effort to help us better understand their content, for each paper an overview of its abstract along with an excerpt from Hugo's notes are presented. It is hoped that having top deep learning papers explained by a noted expert in the field will make some of the more complex aspects of the science more approachable.

**1. Training recurrent networks online without backtracking**

*Authors*: Yann Ollivier, Guillaume Charpiat

*Date posted to arXiv*: 28 Jul 2015

Abstract (excerpt): We introduce the "NoBackTrack" algorithm to train the parameters of dynamical systems such as recurrent neural networks. This algorithm works in an online, memoryless setting, thus requiring no backpropagation through time, and is scalable, avoiding the large computational and memory cost of maintaining the full gradient of the current state with respect to the parameters. [ ... ] Preliminary tests on a simple task show that the stochastic approximation of the gradient introduced in the algorithm does not seem to introduce too much noise in the trajectory, compared to maintaining the full gradient, and confirm the good performance and scalability of the Kalman-like version of NoBackTrack.

Hugo's two cents (excerpt): Online training of RNNs is a big, unsolved problem.

The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work. Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies".

**2. Semi-Supervised Learning with Ladder Network**

*Authors*: Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, Tapani Raiko

*Date posted to arXiv*: 9 Jul 2015

Abstract: We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pretraining. Our work builds on top of the Ladder network proposed by Valpola (2015) which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in various tasks: MNIST and CIFAR-10 classification in a semi-supervised setting and permutation invariant MNIST in both semi-supervised and full-labels setting.

Hugo's two cents (excerpt): What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in low-labeled data regime, in the semi-supervised setting.

Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic.

Pages: 1 2