Understanding NLP and Topic Modeling Part 1
In this post, we seek to understand why topic modeling is important and how it helps us as data scientists.
By Tony Yiu, Data Scientist at Solovis
Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many — chatbots, recommender systems, search, virtual assistants, etc.
So it would be beneficial for budding data scientists to at least understand the basics of NLP even if their career takes them in a completely different direction. And who knows, some topics extracted through NLP might just give your next model that extra analytical boost. Today, in this post, we seek to understand why topic modeling is important and how it helps us as data scientists.
Topic modeling, just as it sounds, is using an algorithm to discover the topic or set of topics that best describes a given text document. You can think of each topic as a word or a set of words.
The Objective of Topic Modeling
The first time I worked with NLP, I wondered to myself:
“Is NLP just another form of EDA (exploratory data analysis)?”
That’s because up until then, I had been mainly building models with a clear objective in mind — use X to forecast or explain Y. NLP was much less structured and clear. Even when I finally successfully ran my topic modeling algorithm, the topics that fell out produced more questions than answers. Here, take a look at some of the topics that came out of my NLP analysis of Reddit:
book read reading history series love author first people novel world finished
like feel something people look seem seems stuff actually right always question
story writing write character main novel chapter short plot first scene advice
think wait bit better people bad true might worth thinking put sell
need financial house information relevant situation invest question money making ask consider
car insurance loan vehicle damage accident hit month payment driver title pay
Some of the topics make sense, some of them do not. And what exactly should I be doing with these topics anyways?
Such is life as a data scientist — often the real work begins only after you finally get your data cleaned and your code debugged. Then, at long last, it’s time to find those annoyingly elusive insights. And that’s precisely the point of NLP and topic modeling. It might not be an end unto itself, but extracting topics via NLP gets us that much closer to generating something useful in much the same way that dimensionality reduction techniques help us on the numerical side of the data science world.
Topic modeling allows us to cut through the noise (deal with the high dimensionality of text data) and identify the signal (the main topics) of our text data.
And with this distilled signal, we can start the real work of generating insights. Let’s go through this step by step.
The Curse of Dimensionality
High dimensional data is regarded as a curse in many data science applications. If you want to understand why in more detail, here is a previous post I wrote on The Curse of Dimensionality. But for those short on time here’s the TLDR:
- If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.
- When we have too many features, observations become harder to cluster — believe it or not, too many dimensions causes every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a big problem. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.
When are we most likely to run into high dimensional data (a.k.a. too many features)? Text data. To see why, imagine how we would encode the data for the following sentence so that an algorithm can do something with it:
“The man was wearing a jacket with a gold star.”
A natural way is what’s called the bag of words approach — bag of words represents a given document as a list of distinct words and their frequencies. So the sentence above would look like this:
So in order to capture a given document, we need a feature for each unique word in it. And for a given document, the value for each feature is the number of times the word appears in the document (so for our earlier example, every word appears once besides “a”, which appears twice).
Now imagine that our document is not an isolated one but rather part of a much larger corpus. Let’s first get the lingo out of the way:
- Document is whatever you define a single observation (a.k.a. a single bag of words) to be. It can range from a single sentence to a whole article or even an entire book — how you define it will be determined by the objective of your analysis.
- Corpus is the totality of all your text data — or in other words, all the documents in your dataset.
If your corpus is large, then you will probably have at least tens of thousands of unique words in it (more if your corpus includes a lot of names). Just attempting to picturing that bag of words makes my head hurt. And attempting to run algorithms on it would be both extremely slow and probably unhelpful — it’s highly likely that you will have more features (distinct words) than observations (documents).
NLP to the Rescue
Now let’s use a practical example to see how NLP helps sift through the dimensionality to reveal signal. Imagine that we want to recommend a few books to a friend. How would we go about doing that?
One way would be to ask:
“Hey, name a few books that you read recently that you really liked.”
And then based on the reply, recommend a few of our favorite books that are most similar to the ones that he or she listed. We just described a simple recommender system.
In order to do what we just described algorithmically, we need to be able to figure out how to measure whether two books are similar or not. We could represent both books as bags of words and try comparing them using a distance measure like euclidean distance, but is that actually helpful? The answer is no for several reasons:
The first reason is stop words (really common words like “the”, “a”, “it”, “and”, etc.) — these words occur very frequently in pretty much all documents and they would inject a lot of meaningless noise into our similarity score (knowing that both books contain many instances of the word “the” would not be helpful at all).
And even if we removed all the stop words, the Curse of Dimensionality still affects us. There are so many distinct words in a book and many of them have zero correlation to the actual topic of the book. Thus, it’s highly likely for our similarity measure to latch onto one of these noise words — this is basically the text version of spurious correlation. For example, we could have a book about fire fighters and a book about salmon fishing, but rank them as highly similar because our algorithm noticed that the words “pole” and “engine” occur frequently in both. This is an accidental and meaningless similarity and it would be problematic if we acted upon it.
This is where topic modeling comes in. Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about. It bears a lot of similarities with something like PCA, which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are way of summarizing our features — for example, it allows us to go from something like 500 features to 10 summary features. These 10 summary feature are basically topics.
In NLP, it works almost exactly the same way. We want to distill our total corpus of books and its 100,000 features (distinct words) into 7 topics (I decided on 7 topics arbitrarily). And once we know the topics along with what they consist of, we can transform each book in our corpus from a noisy bag of words to a clean portfolio of topic loadings:
Now we’re in business. Similarity scores calculated using each book’s topic loadings are a lot more useful than ones calculated using the raw bag of words because spurious similarities are now much less likely.
Even the descriptive statistics of a book are more meaningful in “topics space” than in “bag of words space” — we can now say a book loads heavily on the data science topic instead of puzzling over why the two most frequent words in our bag of words are “forest” and “random”.
A Thought Exercise
In our previous example, we decided to express our book as loadings on 7 topics. But we could have gone with any number of topics (picking the number of topics is more art than science and depends heavily on your data — thus knowing your data well is critical). 10 works too, so does 100. But think about what happens as we keep increasing the number of topics and each topic becomes increasingly granular — the algorithm begins to lose the ability to see the big picture.
For example, let’s say we have three books — Book 1 is a French travel guide, Book 2 is a Chinese travel guide, and Book 3 is an economic history of the urbanization of China. With our 7 topics NLP model, we would classify Books 1 and 2 as travel books (and score them as similar to each other) and Book 3 as a business book (and score it as not similar to the others).
With 5,000 topics, we might classify Book 1 as “Cycling Rural France”, Book 2 as “Traveling Urban China”, and Book 3 as “History Urban China”. Now it is much less clear how we would score them — the algorithm might just throw up its hands and rate all 3 books as equally similar/different. That’s not necessarily wrong (depending on the application) but it does show how high dimensional data (which 5,000 topics definitely is) can inject noise that distorts our analysis in unintended ways.
A general rule of thumb when topic modeling, is to be only as specific as your end application requires you to be, never more.
I realize that so far I’ve been suitably vague on how we actually come up with our topics. That’s because I wanted to fully explore why NLP is important. Next time, I will cover (with Python code) two topic modeling algorithms — LDA (latent Dirichlet allocation) and NMF (non-negative matrix factorization).
Until then, thanks for reading and cheers!
More Data Science and Analytics Related Posts By Me:
- The Curse of Dimensionality
- Understanding PCA
- Business Strategy For Data Scientists
- Business Simulations With Python
- Understanding Bayes’ Theorem
- Understanding The Naive Bayes Classifier
- The Binomial Distribution
Bio: Tony Yiu is a Data Scientist at Solovis. He enjoys slicing through data and building models in order to better understand the problems and opportunities that businesses face.
Original. Reposted with permission.
- Beyond Word Embedding: Key Ideas in Document Embedding
- An Overview of Topics Extraction in Python with Latent Dirichlet Allocation
- Topic Modeling with LSA, PLSA, LDA & lda2Vec