By Maarten Grootendorst, Data Scientist
Often when I am approached by a product owner to do some NLP-based analyses, I am typically asked the following question:
"Which topic can frequently be found in these documents?"
Void of any categories or labels I am forced to look into unsupervised techniques to extract these topics, namely Topic Modeling.
Although topic models such as LDA and NMF have shown to be good starting points, I always felt it took quite some effort through hyperparameter tuning to create meaningful topics.
Moreover, I wanted to use transformer-based models such as BERT as they have shown amazing results in various NLP tasks over the last few years. Pre-trained models are especially helpful as they are supposed to contain more accurate representations of words and sentences.
A few weeks ago I saw this great project named Top2Vec* which leveraged document- and word embeddings to create topics that were easily interpretable. I started looking at the code to generalize Top2Vec such that it could be used with pre-trained transformer models.
The great advantage of Doc2Vec is that the resulting document- and word embeddings are jointly embedding in the same space which allows document embeddings to be represented by nearby word embeddings. Unfortunately, this proved to be difficult as BERT embeddings are token-based and do not necessarily occupy the same space**.
Instead, I decided to come up with a different algorithm that could use BERT and ðŸ¤— transformers embeddings. The result is BERTopic, an algorithm for generating topics using state-of-the-art embeddings.
The main topic of this article will not be the use of BERTopic but a tutorial on how to use BERT to create your own topic model.
PAPER*: Angelov, D. (2020). Top2Vec: Distributed Representations of Topics. arXiv preprint arXiv:2008.09470.
NOTE**: Although you could have them occupy the same space, the resulting size of the word embeddings is quite large due to the contextual nature of BERT. Moreover, there is a chance that the resulting sentence- or document embeddings will degrade in quality.
1. Data & Packages
For this example, we use the famous
20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics. Using Scikit-Learn, we can quickly download and prepare the data:
If you want to speed up training, you can select the subset
train as it will decrease the number of posts you extract.
NOTE: If you want to apply topic modeling not on the entire document but on the paragraph level, I would suggest splitting your data before creating the embeddings.
The very first step we have to do is converting the documents to numerical data. We use BERT for this purpose as it extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used.
How you generate the BERT embeddings for a document is up to you. However, I prefer to use the
sentence-transformers package as the resulting embeddings have shown to be of high quality and typically work quite well for document-level embeddings.
Install the package with
pip install sentence-transformers before generating the document embeddings. If you run into issues installing this package, then it is worth installing Pytorch first.
Then, run the following code to transform your documents in 512-dimensional vectors:
We are using Distilbert as it gives a nice balance between speed and performance. The package has several multi-lingual models available for you to use.
NOTE: Since transformer models have a token limit, you might run into some errors when inputting large documents. In that case, you could consider splitting documents into paragraphs.
We want to make sure that documents with similar topics are clustered together such that we can find the topics within these clusters. Before doing so, we first need to lower the dimensionality of the embeddings as many clustering algorithms handle high dimensionality poorly.
Out of the few dimensionality reduction algorithms, UMAP is arguably the best performing as it keeps a significant portion of the high-dimensional local structure in lower dimensionality.
Install the package with
pip install umap-learn before we lower the dimensionality of the document embeddings. We reduce the dimensionality to 5 while keeping the size of the local neighborhood at 15. You can play around with these values to optimize for your topic creation. Note that a too low dimensionality results in a loss of information while a too high dimensionality results in poorer clustering results.
After having reduced the dimensionality of the documents embeddings to 5, we can cluster the documents with HDBSCAN. HDBSCAN is a density-based algorithm that works quite well with UMAP since UMAP maintains a lot of local structure even in lower-dimensional space. Moreover, HDBSCAN does not force data points to clusters as it considers them outliers.
Install the package with
pip install hdbscan then create the clusters:
Great! We now have clustered similar documents together which should represent the topics that they consist of. To visualize the resulting clusters we can further reduce the dimensionality to 2 and visualize the outliers as grey points:
It is difficult to visualize the individual clusters due to the number of topics generated (~55). However, we can see that even in 2-dimensional space some local structure is kept.
NOTE: You could skip the dimensionality reduction step if you use a clustering algorithm that can handle high dimensionality like a cosine-based k-Means.
4. Topic Creation
What we want to know from the clusters that we generated, is what makes one cluster, based on their content, different from another?
How can we derive topics from clustered documents?
To solve this, I came up with a class-based variant of TF-IDF (c-TF-IDF), that would allow me to extract what makes each set of documents unique compared to the other.
The intuition behind the method is as follows. When you apply TF-IDF as usual on a set of documents, what you are basically doing is comparing the importance of words between documents.
What if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be a very long document per category and the resulting TF-IDF score would demonstrate the important words in a topic.
To create this class-based TF-IDF score, we need to first create a single document for each cluster of documents:
Then, we apply the class-based TF-IDF:
Where the frequency of each word
t is extracted for each class
i and divided by the total number of words
w. This action can be seen as a form of regularization of frequent words in the class. Next, the total, unjoined, number of documents
m is divided by the total frequency of word
t across all classes
Now, we have a single importance value for each word in a cluster which can be used to create the topic. If we take the top 10 most important words in each cluster, then we would get a good representation of a cluster, and thereby a topic.
In order to create a topic representation, we take the top 20 words per topic based on their c-TF-IDF scores. The higher the score, the more representative it should be of its topic as the score is a proxy of information density.
We can use
topic_sizes to view how frequent certain topics are:
The topic name
-1 refers to all documents that did not have any topics assigned. The great thing about HDBSCAN is that not all documents are forced towards a certain cluster. If no cluster could be found, then it is simply an outlier.
We can see that topics 7, 43, 12, and 41 are the largest clusters that we could create. To view the words belonging to those topics, we can simply use the dictionary
top_n_words to access these topics:
Looking at the largest four topics, I would say that these nicely seem to represent easily interpretable topics!
I can see sports, computers, space, and religion as clear topics that were extracted from the data.
5. Topic Reduction
There is a chance that, depending on the dataset, you will get hundreds of topics that were created! You can tweak the parameters of HDBSCAN such that you will get fewer topics through its
min_cluster_size parameter but it does not allow you to specify the exact number of clusters.
A nifty trick that Top2Vec was using is the ability to reduce the number of topics by merging the topic vectors that were most similar to each other.
We can use a similar technique by comparing the c-TF-IDF vectors among topics, merge the most similar ones, and finally re-calculate the c-TF-IDF vectors to update the representation of our topics:
Above, we took the least common topic and merged it with the most similar topic. By repeating this 19 more times we reduced the number of topics from 56 to 36!
NOTE: We can skip the re-calculation part of this pipeline to speed up the topic reduction step. However, it is more accurate to re-calculate the c-TF-IDF vectors as that would better represent the newly generated content of the topics. You can play around with this by, for example, update every n steps to both speed-up the process and still have good topic representations.
TIP: You can use the method described in this article (or simply use BERTopic) to also create sentence-level embeddings. The main advantage of this is the possibility to view the distribution of topics within a single document.
Thank you for reading!
Bio: Maarten Grootendorst is a Data Scientist, mostly working with ML and NLP, with a background in Organizational and Clinical Psychology. Maarten's path to this point has not been conventional, transitioning from psychology to data science, but has left him with a strong desire to create data-driven solutions that make the world a slightly better place.
Original. Reposted with permission.
- Is depth useful for self-attention?
- Which flavor of BERT should you use for your QA task?
- An Introductory Guide to NLP for Data Scientists with 7 Common Techniques