# 5 More arXiv Deep Learning Papers, Explained

Top recent deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.

arXiv, maintained by Cornell University, is a popular open access academic paper preprint repository. It is an outlet for cutting edge research in numerous scientific fields, including machine learning. Mirroring the current general trend in academia, much of the recent posted machine learning research is deep learning related.

Hugo Larochelle, PhD, is a Université de Sherbrooke machine learning professor (on leave), Twitter research scientist, noted neural network researcher, and deep learning aficionado. Since late summer 2015, he has been drafting and publicly sharing notes on arXiv machine learning papers that he has taken an interest in.

A previous KDnuggets article outlined and explained a selection of 5 arXiv machine learning papers that Hugo has read and shared notes on. In an effort to help us better understand new research, this article will present and summarize 5 additional arXiv papers, and will share excerpts from Hugo's notes in order to provide some additional perspective and critique. Links to all original papers, abstracts, and explanatory notes are also included. It is hoped that having top deep learning papers explained by a noted expert in the field will make some of the more complex aspects of the science more approachable.

**1. Infinite Dimensional Word Embeddings**

Authors: Eric Nalisnick, Sachin Ravi

Date posted to arXiv: 17 Nov 2015

Abstract (excerpt):

We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality. By employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable, we define vector dimensionality over a countably infinite domain, allowing vectors to grow as needed during training.

Hugo's Two Cents (excerpt):

This is a quite original use of our "infinite dimensions" trick we introduced in the iRBM. It wasn't entirely "plug and play" either, and the authors had to be smart in the approximations they proposed for training the iSG.

The qualitative results showing how the conditional on the number of dimensions contain information about polysemy are really neat! One assumption behind distributed word embeddings is that they should be able to represent the multiple meanings of words using different dimensions, so it's nice to see that this is exactly what is being learned here.

I think the only thing missing in this paper are comparisons with regular skipgram and perhaps other word embeddings methods on a specific task or on a word similarity task. In v2 of this paper, the authors do mention they are working on such results, so I'm looking forward to seeing those!

**2. Gradient-based Hyperparameter Optimization through Reversible Learning**

Authors: Dougal Maclaurin, David Duvenaud, Ryan P. Adams

Date posted to arXiv: 11 Feb 2015

Abstract (excerpt):

We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.

Hugo's Two Cents (excerpt):

This is one of my favorite papers of this year. While the method of unrolling several steps of gradient descent (100 iterations in the paper) makes it somewhat impractical for large networks (which is probably why they considered 3-layer networks with only 50 hidden units per layer), it provides an incredibly interesting window on what are good hyper-parameter choices for neural networks. Note that, to substantially reduce the memory requirements of the method, the authors had to be quite creative and smart about how to encode changes in the network's weight changes.

There are tons of interesting experiments, which I encourage the reader to go check out (see section 3).

The experiment on "training the training set", i.e. generating the 10 examples (one per class) that would minimize the validation set loss of a network trained on these examples is a pretty cool idea (it essentially learns prototypical images of the digits from 0 to 9 on MNIST).

Note that approaches like the one in this paper make tools for automatic differentiation incredibly valuable. Python autograd, the author's automatic differentiation Python library https://github.com/HIPS/autograd (which inspired our own Torch autograd https://github.com/twitter/torch-autograd) was in fact developed in the context of this paper.