Memory Complexity with Transformers

What’s the problem with running a transformer model on a book with 1 million tokens? What can be a solution to this problem?



The key innovation in Transformers is the introduction of a self-attention mechanism, which computes similarity scores for all pairs of positions in an input sequence, and can be evaluated in parallel for each token of the input sequence, avoiding the sequential dependency of recurrent neural networks, and enabling Transformers to vastly outperform previous sequence models like LSTM.

There are a lot of deep explanations elsewhere so here I’d like to share some example questions in an interview setting.

 

What’s the problem with running a transformer model on a book with 1 million tokens? What can be a solution to this problem?

 

Memory Complexity with Transformers

 

Memory Complexity with Transformers

 

Here are some tips for readers’ reference:

Simply put,


If you try to run a large transformer on the long sequence, you just run out of memory.


According to the Google Research Blog (2021):


A limitation of existing Transformer models and their derivatives is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answeringdocument summarization or genome fragment classification.


Check the explanation by Dr.Younes Bensouda Mourri from Deeplearning.ai:

Check the explanation!

To solve the memory complexity problem of the transformer models.


Two ‘reforms’ were made to the Transformer to make it more memory and compute efficient: the Reversible Layers reduce memory and the Locality Sensitive Hashing(LSH) reduces the cost of the Dot Product attention for large input sizes.


Of course, there are other solutions such as Extended Transformer Construction (ETC) and the like. We will cover more details in a later post!


Happy practicing!


Note: there are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.

Source of images/Good reads: Paper. Improving language models by retrieving from trillions of tokens by Deepmind (2022) Blog. Constructing Transformers For Longer Sequences with Sparse Attention Methods by Google (2021)

Source of video/answers: Natural Language Processing with Attention Models by Dr.Younes Bensouda Mourri from Deeplearning.ai

 
 
Angelina Yang is data and machine learning senior executive with more than 15 years of experience delivering advanced machine learning solutions and capabilities to increase business values in the financial service and fintech industry. Expertise includes AI/ML/NLP/DL model development and deployment in the areas of customer experience, surveillance, conversational AI, risk and compliance, marketing, operations, pricing and data services.

 
Original. Reposted with permission.