Multilabel Document Categorization, step by step example
This detailed guide explores an unsupervised and supervised learning two-stage approach with LDA and BERT to develop a domain-specific document categorizer on unlabeled documents.
By Saurabh Sharma, Machine Learning Engineer.
In a real-world scenario, documents that we encounter usually cover more than one topic. A topic is something that describes the meaning of a document concisely. For instance, let’s take one review from a garage service website — “The process of booking was simple, and the tire I bought was a good price. The only issue was that service took 35 minutes from arriving at the depot to leaving, which I felt was too long for a pre-arranged appointment.”
In this review, there are numerous intuitions, such as — “ease_of_booking”, “tyre_price”, “service_duration” — that we can call “topics.” It’s been a challenging task for researchers to churn out such topic clusters from a raw unstructured set of documents. I propose a two-step approach using LDA and BERT to build a domain-specific document categorizer that categorizes each document into a set of topic clusters from raw unlabelled document datasets.
My approach involves two main subtasks:
- Unsupervised learning using LDA (Latent Dirichlet Allocation) to mine a set of topics from an unlabelled document dataset.
- Supervised learning using BERT to build a muti-topic document categorizer.
I scrapped reviews of an online garage booking service website for this task and stored them into a CSV file.
Created by the author.
Unsupervised learning using LDA
The following pre-processing steps were performed before training the garage booking service reviews using LDA.
This module is the initial and crucial phase of the pre-processing, where the text data was cleaned by removing punctuation, stop words, non-ASCII values. Finally, the list of lemmatized tokens for each of the reviews is returned as an output.
The above lines of code will load the spacy model.
Feature engineering involved the following two crucial modules:
1. Getting bigrams and trigrams tokens for a list of review documents and then deciding to either move forward with the list of bigrams tokens or list of trigram tokens.
Bigrams and trigrams can give us information present in review sentences in the form of phrases. “Scoring” hyper-parameter present in the phraser model provided by gensim thus becomes very important as it decides which algorithm would be used to club words (as possible bigrams or trigrams).
I used the normalised point-wise mutual information (“npmi” ) scoring as described here.
The main difference between the default scoring hyperparameter, i.e., “original_scorer” and “npmi” is in the formula that it uses for evaluation of co-occurrence of words — the former uses measures of frequency while the latter uses probability.
Load the gensim Phraser model for bigram and trigram:
Getting bigram and trigrams:
Bigrams and Trigrams visualization wrt frequency of occurrence
To decide the use of either the list of bigrams tokens or the list of trigram tokens for LDA, I analyzed the frequency distribution of top 20 bigrams and top 20 trigrams wrt their frequency of occurrence in the garage review dataset.
The Phraser model of genism uses underscores (“_”) to form bigrams and trigrams, e.g., “Easy_Service” (for a bigram) and ”Ease_of_use” (for a trigram). The code snippet given below checked whether a word is a bigram or trigram by making use of regular expressions.
The above code snippet helped in getting dictionaries of bigrams and trigrams, with values in the dictionaries being the frequency of occurrence.
Bigram frequency distribution plot:
Created by the author.
Trigram frequency distribution plot:
Created by the author.
Visualization of the top 20 bigrams and top 20 trigrams wrt their frequency of occurrence provided me insight to use a list of token/bigrams of sentences rather than a list of token/trigram as most of the information inferred by trigrams phrases was already being conveyed by bigrams.
2. Retrieving noun and verb chunks from the list of tokens (bigrams) for each review document.
Noun and verb chunks carry the most useful information for conveying meaning in a summarized form for each document. In most cases, only noun chunks are enough to capture insights, but in this case, information like “delivery”, “wait”, etc. was getting lost when I chose only noun chunks. Thus, I used both the noun and verb chunks to capture as many insights as possible by keeping only the noun, adjective, proper noun, verb, and adverb tags from the parts of speech.
Corpus creation for LDA
A dictionary was made using a list returned from the function “return_noun_and_verb_chunks” defined in the code snippet presented above.
This dictionary was then used to obtain a Bag of Words representation for each document in the corpus as this representation is required as a parameter by the gensim LDA model.
GRID search for finding optimal number of topics
The LDA model requires an integer value for a hyperparameter called “num_topics”(i.e., number of topics). This is the most crucial hyperparameter that decides the number of requested latent topics to be extracted from the training corpus (in our case, “bow_corpus”). I plotted coherence vs. the number of topics plot on the result obtained from grid search and then decided to choose that value for the number of topics where coherence was highest just before taking the drop.
y-axis: Coherence, and x-axis: Number of topics. Created by the author.
Thus I chose the number of topics = 36 (as it has the highest coherence before dropping).
Fitting data to LDA model
There are a few other hyperparameters that I tuned while fitting the data to the LDA model:
- Chunksize: This hyperparameter basically can make training faster (provided you have sufficient memory). I took a chunk size of 4096 (i.e., a power of 2 for efficient memory allocation).
- ETA: This hyperparameter makes topics as close to a single word/bi-gram phrase as possible (i.e., word weight in each topic). The smaller the value of eta, the topic becomes closer and closer to a particular word (i.e., the weight of the particular word for that topic becomes higher than the weight of the same word for other topics that are far). I chose eta = 0.1 as it was returning the best coherence.
After fitting the data into the LDA model, I viewed the top 10 words present in each topic:
As each topic returned by LDA is identified by an integer number, knowledge insight into the weight of the words/phrases present for each topic and what information does the documents that have those words/phrases capture gave me ample insights to assign a human-readable label for each topic number. Note that some of the topic numbers captured the same information. Thus, I assigned the same label clusters to such topics out of the 36 topics.
My many-to-many mapping for LDA returned the topic number and the human-readable label assigned accordingly looked like:
Assignment of a topic number to each document present in the corpus was done on the basis of a probability distribution of topics present in the document, and the topic with the maximum probability was assigned to the document.
Converting the topic number obtained for each document into a topic cluster using “topic_equivalent_cluster_dictionary” formed above:
The above code snippets completed the first sub-task out of two subtasks, as it returned topic clusters for each document. Now, learning a supervised model will make it feasible for a person to get a topic cluster for any new unseen document (provided the document is from the same domain).
Supervised learning using BERT
Encoding each word present in the document with BERT embeddings provided rich contextual information for each labeled document present in the dataset. These embeddings were then passed onto a downward task of document classification.
I used the original documents (without pre-processing the documents through spacy) and stored the labels (topic clusters) obtained from LDA for each corresponding document in a separate column in a CSV.
The only pre-processing step I performed for this task was removing the non-ASCII characters, such as emojis etc., from the documents dataset.
Assigning loss, evaluation metric, and optimization parameters:
Since the labels were English word phrases, I one-hot encoded them using the label binarizer provided by sklearn before fitting data into the model:
Fitting the model with a validation split of 15% in order to monitor model’s performance:
I obtained 92% accuracy on my validation set after 12 epochs. I hope your dataset gets even better results. I am looking forward to hearing any feedback or questions.
Original. Reposted with permission.
- Text Analysis 101: Document Classification
- Topic Modeling with BERT
- How to Train a BERT Model From Scratch