The Evolution of Tokenization – Byte Pair Encoding in NLP

Though we have SOTA algorithms for tokenization, it's always a good practice to understand the evolution trail and learning how have we reached here. Read this introduction to Byte Pair Encoding.

comments

By Harshit Tyagi, Data Science Instructor | Mentor | YouTuber

NLP may have been a little late to the AI epiphany but it is doing wonders with organisations like Google, OpenAI releasing state-of-the-art(SOTA) language models like BERT and GPT-2/3 respectively.

GitHub Copilot and OpenAI codex are among a few very popular applications that are in the news. As someone who has very limited exposure to NLP, I decided to take up NLP as an area of research and the next few blogs/videos will be me sharing what I learn after dissecting some important components of NLP.

NLP systems have three main components that help machines understand natural language:

Tokenization
Embedding
Model architectures

Top Deep Learning models like BERT, GPT-2, or GPT-3 all share the same components but with different architectures that distinguish one model from another.

In this newsletter(and notebook), we are going to focus on the basics of the first component of an NLP pipeline which is tokenization. An often overlooked concept but it is a field of research in itself. We have come so far ahead of the traditional NLTK tokenization process.

Though we have SOTA algorithms for tokenization, it's always a good practice to understand the evolution trail and learning how have we reached here.

So, here's what we'll cover:

What is tokenization?
Why do we need a tokenizer?
Types of tokenization - Word, Character, and Subword.
Byte Pair Encoding Algorithm - a version of which is used by most NLP models these days.

The next part of this tutorial will dive into more advanced(or enhanced version of BPE) algorithms:

Unigram Algorithm
WordPiece - BERT transformer
SentencePiece - End-to-End tokenizer system

What is Tokenization?

Tokenization is the process of representing raw text in smaller units called tokens. These tokens can then be mapped with numbers to further feed to an NLP model.

Here's an overly simplified example of what a tokenizer does:

## read the text and enumerate the tokens in the text
text = open('example.txt', 'r').read(). # read a text file

words = text.split(" ") # split the text on spaces

tokens = {v: k for k, v in enumerate(words)} # generate a word to index mapping

Here, we have simply mapped every word in the text to a numerical index. Obviously, this is a very simple example and we have not considered grammar, punctuations, compound words(like test, test-ify, test-ing, etc.).

Thus, we need a more technical and accurate definition of tokenization. To take into account every punctuation and related word, we need to start working at the character level.

There are multiple applications of tokenization. One of the use cases comes from compiler design where we need to parse computer programs to convert raw characters into keywords of a programming language.

In deep learning, tokenization is the process of converting a sequence of characters into a sequence of tokens which further needs to be converted into a sequence of numerical vectors that can be processed by a neural network.

Why do we need a Tokenizer?

The need for a tokenizer has protruded from the question "How can we make machines read?"

One of the common ways of processing textual data is by defining a set of rules in a dictionary and then looking up that fixed dictionary of rules. But this method can only go so far and we want machines to learn these rules from the text that is read by it.

Now, machines don't know any language, nor do they understand sound or phonetics. They need to be taught from scratch and in such a way that they could read any language possible.

Quite a task, right?

Humans learn a language by connecting sound to the meaning and then we learn to read and write in that language. Machines can't do that, so they need to be given the most basic units of text to start processing it.

That's where tokenization comes into play. Break down the text into smaller units called "tokens".

And there are different ways of tokenizing text which is what we'll learn now.

Tokenization policies - simple ways to tokenize

To make the deep learning model learn from the text, we need a two-step process:

Tokenize - decide the algorithm to be used to generate the tokens.
Encode the tokens to vectors

As the first step suggests, we need to decide how to convert text into small tokens. A simple and straight forward method that most of us would propose is the word-based tokens, splitting the text by space.

Problems with Word tokenizer

the risk of missing words in the training data: with word tokens, your model won't recognize the variants of words that were not part of the data on which the model was trained. So, if your model has seen foot and ball in the training data but the final text has football, the model won't be able to recognize the word and it will be treated with <UNK> token. Similarly, punctuations pose another problem, let or let's will need individual tokens and it is an inefficient solution. This will require a huge vocabulary to make sure you've every variant of the word. Even if you add a lemmatizer to solve this problem, you're adding an extra step in your processing pipeline.
Handling slang and abbreviations- another problem is the use of slang and abbreviations in texts these days such as "FOMO", "LOL", "tl;dr" etc. What do we do for these words?
What if the language doesn't use space for segmentation: for a language like Chinese, which doesn't use spaces for word separation, this tokenizer will fail completely.

After encountering these problems, researchers looked into another approach which was tokenizing all the characters.

Character-based tokenization

To resolve the problems associated with word-based tokenization, an alternative approach of character-by-character tokenization was tried.

This did solve the problem of missing words as now we are dealing with characters that can be encoded using ASCII or Unicode and it could generate embedding for any word now.

Every character, be it space or apostrophes or colons, can now be assigned a symbol to generate a sequence of vectors.

But this approach had its own cons.

Drawbacks of character-based models

Requirement of more compute: character-based models will treat each character as tokens and more tokens mean more input computations to process each token which in turn requires more compute resources. For a 5-word long sentence, you may need to process 30 tokens instead of 5 word-based tokens.
Narrows down the number of NLP tasks and applications: with long sequences of characters, only a certain type of neural network architecture can be used. This puts a limitation on the type of NLP tasks we can perform. For applications like Entity recognition or text classification, character-based encoding might turn out to be an inefficient approach.
Risk of learning incorrect semantics: working with characters could generate incorrect spellings of words. Also, with no inherent meaning, learning with characters is like learning with no meaningful semantics.

What's fascinating is that for such a seemingly simple task, mutliple algorithms are written to find the optimal tokenization policy.

After understanding the pros and cons of these tokenization methods, it makes sense to look for an approach that offers a middle route i.e. preserve the semantics with limited vocabulary that can generate all the words in the text on merging.

Subword Tokenization

With character-based models, we risk losing the semantic features of the word and with word-based tokenization, we need a very large vocabulary to encompass all the possible variations of every word.

So, the goal was to develop an algorithm that could:

Retain the semantic features of the token i.e. information per token.
tokenize without demanding a very large vocabulary with a finite set of words.

To solve this problem, we could think of breaking down the words based on a set of prefixes and suffixes. For example, we can write a rule-based system to identify subwords like "##s", "##ing", "##ify", "un##" etc., where the position of double hash denotes prefix and suffixes.

So, a word like "unhappily" is tokenized using subwords like "un##", "happ", and "##ily".

The model only learns a few subwords and then puts them together to create other words. This solves the problem of memory requirement and effort to create a large vocabulary.

Problems with this algorithm:

Some of the subwords that are created as per the defined rules may never appear in your text to tokenize and may end up occupying extra memory.
Also, for every language, we'll need to define a different set of rules to create subwords.

To alleviate this problem, in practice, most modern tokenizers have a training phase that identifies the recurring text in the input corpus and creates new subwords tokens. For rare patterns, we stick to word-based tokens.

Another important factor that plays a vital role in this process is the size of the vocabulary that is set by the user. Large vocabulary size allows for more common words to be tokenized whereas smaller vocabulary requires more subwords to be created to create every word in the text without using the <UNK> token.

Striking the balance for your application is key here.

Byte Pair Encoding(BPE)

BPE was originally a data compression algorithm that is used to find the best way to represent data by identifying the common byte pairs. It is now used in NLP to find the best representation of text using the least number of tokens.

Here's how it works:

Add an identifier(</w>) at the end of each word to identify the end of a word and then calculate the word frequency in the text.
Split the word into characters and then calculate the character frequency.
From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.
Keep iterating until you have reached the iteration limit(set by you) or if you have reached the token limit.

Let's go through each step(in code) for a sample text. For coding this, I have taken help from Lei Mao's very minimalistic blog on BPE. I encourage you to check it out!

Here's our sample text:

"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

## define the text first
text = "There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

## get the word frequency and add the end of word () token ## at the end of each word
words = text.strip().split(" ")

print(f"Vocabulary size: {len(words)}")

Step 2: Split the word into characters and then calculate the character frequency.

char_freq_dict = collections.defaultdict(int)
for word, freq in word_freq_dict.items():
    chars = word.split()
    for char in chars:
        char_freq_dict[char] += freq

char_freq_dict

Step 3: Merge the most frequently occurring consecutive byte pairing.

import re

## create all possible consecutive pairs
pairs = collections.defaultdict(int)
for word, freq in word_freq_dict.items():
    chars = word.split()
    for i in range(len(chars)-1):
        pairs[chars[i], chars[i+1]] += freq

Step 4 - Iterate a number of times to find the best(in terms of frequency) pairs to encode and then concatenate them to find the subwords.

It is better at this point to turn structure our code into functions. This will require us to perform the following steps:

Find the most frequently occurring byte pairs in each iteration.
Merge these tokens.
Recalculate the character tokens frequency with the new pair encoding added.
Keep doing it until there is no more pair or you reach the end of the for a loop.

For detailed code, you should check out my Colab notebook.

Here’s a trimmed output of those 4 steps:

So as we iterate with each best pair, we merge(concatenating) the pair and you can see as we recalculate the frequency, the original character token frequency is reduced and the new paired token frequency pops up in the token dictionary.

If you look at the number of tokens created, it first increases because we create new pairings but the number starts to decrease after a number of iterations.

Here, we started from 25 tokens, went up to 31 tokens in the 14th iteration, and then came down to 16 tokens in the 50th iteration. Interesting, right?

Scope of improvement for the BPE algorithm

BPE algorithm is a greedy algorithm i.e. it tries to find the best pair in each iteration. And there are some limitations of this greedy approach.

Thus, there are pros and cons of the BPE algorithm too.

The final tokens will vary depending upon the number of iterations you have run which also causes another problem. We now can have different tokens for a single text and thus different embeddings.

To address this issue, multiple solutions were proposed but the one that stood out was a unigram language model that added subword regularization(a new method of subword segmentation) training that calculates the probability for each subword token to choose the best option using a loss function. More on this in the upcoming blogs.

Do they use BPE in BERTs or GPTs?

Models like BERT or GPT-2 use some version of the BPE or the unigram model to tokenize the input text.

BERT included a new algorithm called WordPiece which is also similar to the BPE but has an added layer of likelihood calculation to decide whether the merged token will make the final cut.

Summary

What you've learned(if at all) in this blog how a machine starts to make sense of language by breaking down the text into very small units.

Now, there are many ways to break text down and thus it becomes important to compare one approach with another.

We started off by understanding tokenization by splitting the English text by spaces but not every language is written the same way(i.e. using spaces to denote segmentation) so we looked at splitting by character to generate character tokens.

The problem with characters was the loss of semantic features from the tokens at the risk of creating incorrect word representations or embeddings.

To get the best of both worlds, subword tokenization was introduced which was more promising and then we looked at the BPE algorithm to implement subword tokenization.

More on the next steps and advanced tokenizers like WordPiece, SentencePiece and how to work with the HuggingFace tokenizer next week.

References and Notes

My post is actually an accumulation of the following papers and blogs that I encourage you to read:

Neural Machine Translation of Rare Words with Subword Units - Research paper that discusses different segmentation techniques based BPE compression algorithm.
GitHub repo on Subword NMT(Neural Machine Translation) - supporting code for the above paper.
Lei Mao’s blog on Byte Pair Encoding - I used the code in his blog to implement and understand BPE myself.
How Machines read - a blog by Cathal Horan.

If you’re looking to start in the field of data science or ML, check out my course on Foundations of Data Science & ML.

If you would like to see more of such content and you are not a subscriber, consider subscribing to my newsletter using the button below.

Bio: Harshit Tyagi is an engineer with amalgamated experience in web technologies and data science(aka full-stack data science). He has mentored over 1000 AI/Web/Data Science aspirants, and is designing data science and ML engineering learning tracks. Previously, Harshit developed data processing algorithms with research scientists at Yale, MIT, and UCLA.

Original. Reposted with permission.

Related: