Research Guide for Transformers

The problem with RNNs and CNNs is that they aren’t able to keep up with context and content when sentences are too long. This limitation has been solved by paying attention to the word that is currently being operated on. This guide will focus on how this problem can be addressed by Transformers with the help of deep learning.

By Derrick Mwiti, Data Scientist on October 30, 2019 in BERT, NLP, Research, Transformer, ULMFiT

comments

Transformers are a type of neural network used in neural machine translation, which mainly involves tasks that transform an input sequence to an output sequence. Such tasks include speech recognition and text-to-speech transformation, just to mention a few.

These kinds of tasks require memory—the upcoming sentence has to work with some context from the previous sentence. This is quite critical so as not to lose any important context between sentences.

Until recently, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been used to tackle this challenge. The problem with these is that they aren’t able to keep up with context and content when sentences are too long. This limitation has been solved by paying attention to the word that is currently being operated on. This guide will focus on how this problem can be addressed by Transformers with the help of deep learning.

Attention Is All You Need (2017)

Authors of this paper propose a network architecture — the Transformer — that’s solely based on attention mechanisms. The model achieves 28.4 BLEU (Bilingual Evaluation Understudy) on the WMT 2014 English to-German translation task. The Transformer’s transduction model uses self-attention to compute representations of its input and output without using convolution or sequenced-aligned RNNs.

Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an...

The majority of neural sequence transduction models have an encoder-decoder model. The Transformer uses the same model with the incorporation of self-attention. Fully connected layers are used for both the encoder and decoder. The encoder is made up of a stack of 6 identical layers, with each layer having 2 sub-layers. The first sub-layer is a multi-head self-attention mechanism, while the second is a position-wise fully connected feed-forward network. There is a residual connection around each of the two sub-layers. This is then followed by a normalization layer.

The decoder also has 6 identical layers with two sub-layers. The decoder incorporates a third sub-layer which conducts multi-head attention over the output of the encoder stack. Each of the sub-layers is also surrounded by residual connections followed by layer normalization. In order to prevent positions from attending to subsequent positions, the self-attention layer in the decoder stack is modified.
The Attention function involves mapping a query and a set of key-value pairs to an output. The query, keys, values, and output are all vectors. The weighted sum of the values forms the output. The weight that’s assigned to each value is calculated by a compatibility function of the query with the corresponding key.

Research Guide for Transformers

Attention Is All You Need (2017)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019)

XLNet: Generalized Autoregressive Pretraining for Language Understanding (2019)

Entity-aware ELMo: Learning Contextual Entity Representation for Entity Disambiguation (2019)

Universal Language Model Fine-tuning for Text Classification (ULMFiT) (2018)

Universal Transformers (ICLR 2019)

Conclusion

More On This Topic

Latest Posts

Top Posts