Silver BlogDeep Learning Next Step: Transformers and Attention Mechanism

With the pervasive importance of NLP in so many of today's applications of deep learning, find out how advanced translation techniques can be further enhanced by transformers and attention mechanisms.

By Preet Gandhi.

From Alexa to Google Translate, one of the most impactful branches of Deep Learning is Natural Language Processing. Language translation has become an important necessity in this globalizing world. Advances in NLP have given rise to many neural machine translation techniques such as the Sequence-to-Sequence (Seq2Seq) models which can further be enhanced by transformers and attention mechanisms.

Seq2Seq Models

Seq2Seq are a broad class of models that translate one sequence to another. Encoder-decoder models are a widely used subclass of it. The encoder takes in the input sequence (source language) and maps it to intermediate hidden vector (higher dimensional space) which encodes all the information of the source. This, in turn, is taken by the decoder and mapped to an output sequence (target language). The model has the power to handle input of variable lengths. Encoders and decoders are both RNNs. Generally, LSTM (Long Short Term Memory) is used as the data is sequence-dependent (order of words is important). Hence, it's important to give meaning to the sequence while remembering/forgetting the parts which are important/unimportant by the use of the input gate/output gate to capture long distance dependencies in the sequence. The output is the sequence that mathematically reflects the highest P(Output sequence | Input sequence). However, there is a problem with this approach as it compresses all the information of an input source sentence into a fixed-length vector (output of last hidden state) which is then taken by the decoder. It has been shown that this leads to a decline in performance when dealing with long sentences.

Transformer Fig1 Encoder Decoder

Figure 1: Encoder and Decoder.

Attention Mechanism

An approach to solve the problem of loss of relevant information in long sentences is to use the attention mechanism. The first word of the source sentence is probably highly correlated with the first word of the target sentence. Each time the model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence. The encoder works as usual, but the decoder’s hidden state is computed with a context vector, the previous output and the previous hidden state. Context vectors are computed as a weighted sum of annotations generated by the encoder.  We have a separate context vector for each target word. In case of a bidirectional LSTM these annotations are concatenations of hidden states in the forward and backward directions. The weight of each annotation is computed by an alignment model (i.e., feedforward network) which scores how well the inputs and the output match. The attention scores (alphas), weights of the hidden states when computing the context vector, show how important a given annotation is in deciding the next state and generating the output word.

Transformer Fig2 Rnn

Figure 2: Attention mechanism in RNN.

Types of attention:

1) Global attention: Uses all hidden states from the encoder to compute the context vector. This is computationally costly as all words from source are considered for each target word.

2) Local attention: Chooses a position in the source sentence to determine the window of words to consider

3) Two-way attention: The same model attends to the hypothesis and premise, and both representations are concatenated. However, this model can’t differentiate that the alignment between stop words is less important than alignment between content words.

4) Self-attention: A mechanism relating different positions of a single sequence to compute its internal representation.

5) Key-value attention: Output vectors are separated into keys to calculate attention and values to encode the next-word distribution and context representation.

6) Hierarchical-nested attention: Two attention levels - first at the word level and second at the sentence level. This highlights the highly informative components of a document.

We can interpret and visualize what the model is doing. By visualizing the attention weight matrix, we can understand how the model is translating:

Visualizing the attention mechanism

Figure 3: Visualizing the attention mechanism.

Attention is costly as we need to calculate a value for each combination of input and output word. For character-level computations with sequences consisting of hundreds of tokens, the mechanisms become expensive. We see that due to the attention mechanism, decoder captures global information rather than to rely solely based on one hidden state. Dependencies are learned between the inputs and outputs. But, in the Transformer architecture this idea is extended to learn intra-input and intra-output dependencies as well.


The transformer is a new encoder-decoder architecture that uses only the attention mechanism instead of RNN to encode each position, to relate two distant words of both the inputs and outputs w.r.t. itself, which then can be parallelized, thus accelerating the training. As RNN is sequential, it takes 10 computation steps if two words are ten words apart, but in self-attention, it's just one layer. It has multiple layers of self-attention where all of keys (vector representations of all the words in the sequence), values, and queries (vector representation of one word in the sequence) come from the input sentence itself. Weights are defined by how each word of the sequence is influenced by all the other words in the sequence. The weight calculation can be done in parallel and is called multi-head attention. Since we don’t use RNN, we have positional encoders added to embedded representation of words to maintain the order.

Multi-head attention

Figure 4: Multi-head attention.

For multi-head attention, the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys. This architecture uses multiplicative attention function and computes multiple attention weighted sums each of which is a linear transformation of the input representation.

The encoding and decoding components are a stack of the same number of encoders and decoders, respectively. Each element of the stack has an identical structure but don’t share weights.

Transformer Fig5 Student

Transformer Fig6

Each encoder has a self-attention layer and then a feedforward layer. In the self-attention layer, encoder aggregates information from all of the other words, generating a new representation per word informed by the entire contexts. The same feed-forward network is independently applied to each position. At each successive position in input sentence, self-attention looks at other positions for clues that help better encode the word. Word in each position flows through its path in the encoder with dependencies between these paths in the self-attention layer. The feed-forward layer doesn’t have those dependencies allowing parallel execution. After each sub-layer of the encoder, there is a normalization step. A positional vector is added to each input embedding which follows a specific pattern that is learned by the model to help know the distance between different words or the position of each. The multi-head attention module that connects the encoder and decoder will make sure that the encoder input-sequence is taken into account together with the decoder input-sequence up to a given position.

Transformer Fig7 Probabilities

The decoder has a self-attention at the beginning and then encoder-decoder attention and feed-forward henceforth. The decoder input will be shifted to the right by one position and use start of word token as first character as we don’t want our model to learn how to copy our decoder input during training since the target word/character for position i would be the word/character i in the decoder input. However, we want to learn that given the encoder sequence and a particular decoder sequence, which has been already seen by the model, we predict the next word/character. Thus, by shifting the decoder input by one position, our model needs to predict the target word/character for position i having only seen the word/characters 1, …, i-1 in the decoder sequence. We append an end-of-sentence token to the decoder input sequence to mark the end of that sequence, and it is also appended to the target output sentence. Transformer applies a mask to the input in the first multi-head attention module to avoid seeing potential ‘future’ sequence elements. If there were no mask, the multi-head attention would consider the whole decoder input sequence at each position. The output of one decoder is fed to another at each time step. Position vectors are added at each step as well. The decoder output goes to a linear layer (fully connected NN) which outputs a logits vector (a much larger vector). Finally the softmax gives out the probabilities where the corresponding word associated with the highest probability is picked as the output of this time step.

Transformers are suited for sequence transduction (language translation), the classic language analysis task of syntactic constituency parsing, and different inputs and outputs modalities, such as images and video and co-reference resolution. The potential of this algorithm is huge as it can be applied to images and videos. The efficiency of this algorithm is an active research area, which can be improved by trying different methods of positional encoding schemes (adding vs. concatenation with the word embeddings, learned vs preset positional encoding, etc.)



Bio: Preet Gandhi was a data science student at NYU CDS and is an avid AI enthusiast.