Word Embeddings & Self-Supervised Learning, Explained

There are many algorithms to learn word embeddings. Here, we consider only one of them: word2vec, and only one version of word2vec called skip-gram, which works well in practice.



By Andriy Burkov, Author of The Hundred-Page Machine Learning Book

Editor's note: This is an excerpt from Chapter 10 of Andriy Burkov's recently released The Hundred-Page Machine Learning Book.

We have already discussed word embeddings in Chapter 7. Recall that word embeddings are feature vectors that represent words. They have the property that similar words have similar feature vectors. The question that you probably wanted to ask is where these word embeddings come from. The answer is (again): they are learned from data.

There are many algorithms to learn word embeddings. Here, we consider only one of them: word2vec, and only one version of word2vec called skip-gram, which works well in practice. Pretrained word2vec embeddings for many languages are available to download online.

In word embedding learning, our goal is to build a model which we can use to convert a one-hot encoding of a word into a word embedding. Let our dictionary contain 10,000 words.

The one-hot vector for each word is a 10,000-dimensional vector of all zeroes except for one dimension that contains a 1. Different words have a 1 in different dimensions.

Consider a sentence: “I almost finished reading the book on machine learning.” Now, consider the same sentence from which we have removed one word, say “book.” Our sentence becomes: “I almost finished reading the · on machine learning.” Now let’s only keep the three words before the · and three words after: “finished reading the · on machine learning.” Looking at this seven-word window around the ·, if I ask you to guess what · stands for, you would probably say: “book,” “article,” or “paper.” That’s how the context words let you predict the word they surround. It’s also how the machine can learn that words “book,” “paper,”and “article” have a similar meaning: because they share similar contexts in multiple texts.

It turns out that it works the other way around too: a word can predict the context that surrounds it. The piece “finished reading the · on machine learning” is called a skip-gram with window size 7 (3 + 1 + 3). By using the documents available on the Web, we can easily create hundreds of millions of skip-grams.

Let’s denote a skip-gram like this: [x−3, x−2, x−1, x, x+1, x+2, x+3]. In our sentence, x−3 is the one-hot vector for “finished,” x−2 corresponds to “reading,” x is the skipped word (·), x+1 is “on” and so on. A skip-gram with window size 5 will look like this: [x−2, x−1, x, x+1, x+2].

The skip-gram model with window size 5 is schematically depicted in Figure 2. It is a fully-connected network, like the multilayer perceptron. The input word is the one denoted as · in the skip-gram. The neural network has to learn to predict the context words of the skip-gram given the central word.

You can see now why the learning of this kind is called self-supervised: the labeled examples get extracted from the unlabeled data such as text.

Figure
Figure 2: The skip-gram model with window size 5 and the embedding layer of 300 units.

The activation function used in the output layer is softmax. The cost function is the negative log-likelihood. The embedding for a word is obtained as the output of the embedding layer when the one-hot encoding of this word is given as the input to the model.

Because of the large number of parameters in the word2vec models, two techniques are used to make the computation more efficient: hierarchical softmax (an efficient way of computing softmax that consists in representing the outputs of softmax as leaves of a binary tree) and negative sampling (the idea is only to update a random sample of all outputs per iteration of gradient descent). I leave these for further reading.

 
Bio: Andriy Burkov (@burkov) is the author of The Hundred-Page Machine Learning Book, and Machine Learning Team Leader at Gartner.

Related: