Top 5 arXiv Deep Learning Papers, Explained

Top deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.

3. Towards Neural Network-based Reasoning
Authors: Baolin Peng, Zhengdong Lu, Hang Li, Kam-Fai Wong
Date posted to arXiv: 22 Aug 2015

Abstract (excerpt): We propose Neural Reasoner, a framework for neural network-based reasoning over natural language sentences. Given a question, Neural Reasoner can infer over multiple supporting facts and find an answer to the question in specific forms. Neural Reasoner has 1) a specific interaction-pooling mechanism, allowing it to examine multiple facts, and 2) a deep architecture, allowing it to model the complicated logical relations in reasoning tasks. Assuming no particular structure exists in the question and facts, Neural Reasoner is able to accommodate different types of reasoning and different forms of language expressions. [ ... ] Our empirical studies show that Neural Reasoner can outperform existing neural reasoning systems with remarkable margins on two difficult artificial tasks (Positional Reasoning and Path Finding).

Hugo's two cents (excerpt): The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising.

I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation.

4. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Authors: Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer
Date posted to arXiv: 9 Jun 2015

Abstract (excerpt): Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements.

Hugo's two cents (excerpt): Big fan of this paper. It both identifies an important flaw in how sequential prediction models are currently trained and, most importantly, suggests a solution that is simple yet effective. I also believe that this approach played a non-negligible role in Google's winner system for image caption generation, in the Microsoft COCO competition.

My alternative interpretation of why Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. In terms of ML, it is as bad to put high probability on an output sequence that has just 1 token that's wrong, than it is to put the same amount of probability on a sequence that has all tokens wrong. Yet, say for image caption generation, outputting a sentence that is one word away from the ground truth is clearly preferable from making a mistake on a words (something that is also reflected in the performance metrics, such as BLEU).

By training the model to be robust to its own mistakes, Scheduled Sampling ensures that errors won't accumulate and makes predictions that are entirely off much less likely.

5. LSTM: A Search Space Odyssey
Authors: Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber
Date posted to arXiv: 13 Mar 2015

Abstract (excerpt): In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs (about 15 years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

Hugo's two cents (excerpt): This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM.

The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyper-parameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information.

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.