Understanding Backpropagation as Applied to LSTM

Backpropagation is one of those topics that seem to confuse many once you move past feed-forward neural networks and progress to convolutional and recurrent neural networks. This article gives you and overall process to understanding back propagation by giving you the underlying principles of backpropagation.

By Abraham Kang & Jae Duk Seo

With special thanks to Kunal Patel and Mohammad-Mahdi Moazzami for reviewing the this paper and providing feedback.

Editor's note: This is an excerpt from a very thorough and informative tutorial that the authors have made available to KDnuggets. While too lengthy to post the entire paper directly on our site, if you like what you see below and are interested in reading the entire tutorial, you can find the PDF here.

Backpropagation is one of those topics that seem to confuse many (except for in straightforward cases such as feed-forward neural networks). If you try to Google the topic you get a number of articles that make a valiant effort to explain the area but sometimes the notation and pictures are confusing because of the way they orient the inputs and outputs or variable names being represented differently. Even the representation of things like forking within LSTM is assumed and omitted from discussion. (https://towardsdatascience.com/back-to-basics-deriving-back-propagation-on-simple-rnn-lstm-feat-aidan-gomez-c7f286ba973d, https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9, http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/, https://arxiv.org/abs/1610.02583, https://machinelearningmastery.com/gentle-introduction-backpropagation-time/, and https://www.coursera.org/lecture/nlp-sequence-models/backpropagation-through-time-bc7ED ).

We are going to explain backpropagation through an LSTM RNN in a different way. We want to give you general principles for deciphering backpropagation through any neural network. Then apply those principles to LSTM (Long-Short Term Memory) RNNs (Recurrent Neural Networks). Let’s start with the notes we took to figure this stuff out.


Just kidding. This is just our thought process.  We will make it easier.

If you haven’t read Matt Mazur’s excellent A Step by Step Backpropagation Example please do so before continuing. It is still one of the best explanations of backpropagation out there and it will make everything we talk about seem more familiar.

There are 3 principles that you have to understand in order to comprehend back propagation in most neural networks: forward propagation in functional stages, the chain rule, and working in reverse through the functional stages and forks using partial derivatives.


Forward Propagation in Functional Stages

When you look at a neural network, the inputs are passed through functional stages to become outputs. Those outputs become inputs to the next functional stage and turn into outputs.  This continues until the final output is the result at the end of the neural network. A functional stage is an operation that takes some input and transforms it into some output at the most granular level. Some examples of functional stages are the summing of inputs multiplied by weights, convolution operation (summing over a structural subset of the input where the patch values that constitute the weights are multiplied against a subset of the input having the same shape as the patch), passing input through a activation function, max pooling, softmaxing, loss calculation, etc. In order to work through back propagation, you need to first be aware of all functional stages that are a part of forward propagation.

Let’s look at LSTM.


To continue reading, download the PDF here.

Abraham Kang is fascinated with the nuanced details associated with machine learning algorithms, programming languages and their associated APIs. Kang has a B.S. from Cornell University. He has worked for various companies helping to drive AI, security, and development. He also worked as Principal Security Researcher for Fortify in their Software Security Research group. Prior to this, Abraham worked in application security for over 10 years. He is focused on AI/ML, application, framework, blockchain smart contracts, intelligent assistants, and mobile security and has presented his findings at Black Hat USA, DEFCON, OWASP AppSec USA, RSA USA and BSIDE.

Jae Duk Seo is a fourth year computer scientist at Ryerson University.