Multi-Class Text Classification with Doc2Vec & Logistic Regression
Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method. In order to understand doc2vec, it is advisable to understand word2vec approach.
By Susan Li, Sr. Data Scientist
Photo credit: Pexels
In order to understand doc2vec, it is advisable to understand word2vec approach. However, the complete mathematical details is out of scope of this article. If you are new to word2vec and doc2vec, the following resources can help you to get start:
- Distributed Representations of Words and Phrases and their Compositionality
- Distributed Representations of Sentences and Documents
- A gentle introduction to Doc2Vec
- Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset
- Document classification with word embeddings tutorial
Using the same data set when we did Multi-Class Text Classification with Scikit-Learn, In this article, we’ll classify complaint narrative by product using doc2vec techniques in Gensim. Let’s get started!
The goal is to classify consumer finance complaints into 12 pre-defined classes. The data can be downloaded from data.gov.
After remove null values in narrative columns, we will need to re-index the data frame.
We have over 63 million words, it is a relatively large data set.
The classes are imbalanced, however, a naive classifier that predicts everything to be Debt collection will only achieve over 20% accuracy.
Let’s have a look a few examples of complaint narrative and its associated product.
Below we define a function to convert text to lower-case and strip punctuation/symbols from words and so on.
The following steps include train/test split of 70/30, remove stop-words and tokenize text using NLTK tokenizer. For our first try we tag every complaint narrative with its product.
This is what a training entry looks like — an example complaint narrative tagged by “Credit reporting”.
Set-up Doc2Vec Training & Evaluation Models
First, we instantiate a doc2vec model — Distributed Bag of Words (DBOW). In the word2vec architecture, the two algorithm names are “continuous bag of words” (CBOW) and “skip-gram” (SG); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (DM) and “distributed bag of words” (DBOW).
Distributed Bag of Words (DBOW)
DBOW is the doc2vec model analogous to Skip-gram model in word2vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.
We will vary the following parameters:
dm=0, distributed bag of words (PV-DBOW) is used; if
dm=1,‘distributed memory’ (PV-DM) is used.
- 300- dimensional feature vectors.
min_count=2, ignores all words with total frequency lower than this.
negative=5, specifies how many “noise words” should be drawn.
hs=0, and negative is non-zero, negative sampling will be used.
sample=0, the threshold for configuring which higher-frequency words are randomly down sampled.
workers=cores, use these many worker threads to train the model (=faster training with multicore machines).
Building a Vocabulary
Training a doc2vec model is rather straight-forward in Gensim, we initialize the model and train for 30 epochs.
Building the Final Vector Feature for the Classifier
Train the Logistic Regression Classifier.
Testing accuracy 0.6683609437751004
Testing F1 score: 0.651646431211616
Distributed Memory (DM)
Distributed Memory (DM) acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document. We again instantiate a Doc2Vec model with a vector size with 300 words and iterating over the training corpus 30 times.
Train the Logistic Regression Classifier
Testing accuracy 0.47498326639892907
Testing F1 score: 0.4445833078167434
According to Gensim doc2vec tutorial on the IMDB sentiment data set, combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation.
First, we delete temporary training data to free up RAM.
Concatenate two models.
Building feature vectors.
Train the Logistic Regression
Testing accuracy 0.6778572623828648
Testing F1 score: 0.664561533967402
The result improved by 1%.
For this article, I used training set to train doc2vec, however, in Gensim’s tutorial, the whole data set was used for training, I tried that approach, using the whole data set to train doc2vec classifier for our consumer complaint classification, I was able to achieve 70% accuracy. You can find that notebookhere, it is a little different approach.
Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.
Original. Reposted with permission.
- Machine Learning for Text Classification Using SpaCy in Python
- Topic Modeling with LSA, PLSA, LDA & lda2Vec
- Multi-Class Text Classification with Scikit-Learn