How to Begin Your NLP Journey

In this blog post, learn how to process text using Python.

By Diego Lopez Yse, Data Scientist

Natural Language Processing (NLP) is one of the most exciting fields in Artificial Intelligence. It allows machines to process and understand human language in a variety of ways, and it’s triggering a revolution in the way we interact with systems and technology.

In a previous post I talked about NLP, its real-world applications, and some of its core concepts. Now I want to show you that NLP is as real as it gets, and anyone can start learning it. How? Let’s start with a simple text, and perform some Exploratory Data Analysis (EDA) around it using some NLP techniques. This way we can make sense out of data with simple and powerful tools before getting ourselves busy with any model or more complex tasks.

Define your text

Stephen Hawking once said:

“Artificial Intelligence (AI) is likely to be either the best or the worst thing to happen to humanity”

I couldn’t agree more with him, and time will tell what will actually happen. Nevertheless, this is a proper sentence to test some NLP techniques. To do that, let’s start by saving the phrase as a variable called “text”:

text = “Artificial Intelligence (AI) is likely to be either the best or the worst thing to happen to humanity.”

Using the langdetect library, we can check its language, and find out the probability of being written in that language:

import langdetect
from langdetect import detect_langs
print(detect_langs(text))

With a certainty of more than 99,9% we can state that this phrase is written in English language. You should also consider using spelling check functionalities to correct any grammatical mistakes.

What about the number of characters?

len(text)

We have 102 characters, including blank spaces. And the number of distinct characters?

len(set(text))

Let’s take a look at them:

print(sorted(set(text)))

There’s something interesting here. We’re not only counting the non-alphanumerical characters like ‘(‘ and ‘.’, but also the blank spaces, and what’s even more, capitalized letters are considered different characters in relation to the lowercased ones.

Tokenization

Tokenization is the process of segmenting running text into sentences and words. In essence, it’s the task of cutting a text into pieces called tokens. We use the NLTK library to perform this task:

import nltk
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize(text)
print(tokenized_word)

We can see that tokenization produces a list of words:

type(tokenized_word)

Which means we can call elements within it.

tokenized_word[2:9]

How many tokens we have?

len(tokenized_word)

And unique tokens?

len(set(tokenized_word))

Now we can calculate a measure related to the lexical richness of the text:

len(set(tokenized_word)) / len(tokenized_word)

This shows that the number of distinct words is 85,7% of the total number of words.

Lowercase & punctuation

Now let’s lowercase the text to standardize characters and for future stopwords removal:

tk_low = [w.lower() for w in tokenized_word]
print(tk_low)

Next, we remove non-alphanumerical characters:

nltk.download(“punkt”)
tk_low_np = remove_punct(tk_low)
print(tk_low_np)

Let’s visualize the cumulative frequency distribution of words:

from nltk.probability import FreqDist
fdist = FreqDist(tk_low_np)
fdist.plot(title = ‘Word frequency distribution’, cumulative = True)

We can see that the words “to” and “the” appear most often, but they don’t really add information to the text. They are what’s known as stopwords.

Stopwords removal

This process includes getting rid of common language articles, pronouns and prepositions such as “and”, “the” or “to” in English. In this process some very common words that appear to provide little or no value to the NLP objective are filtered and excluded from the text to be processed, hence removing widespread and frequent terms that are not informative about the corresponding text.

First, we need to create a list of stopwords and filter them our from our list of tokens:

from nltk.corpus import stopwords
stop_words = set(stopwords.words(“english”))
print(stop_words)

We’ll use this list from NLTK library, but bear in mind that you can create your own set of stop words. Let’s look for the word “the” in the list:

print(‘the’ in stop_words)

Now, let’s clean our text from these stopwords:

filtered_text = []
for w in tk_low_np:
   if w not in stop_words:
      filtered_text.append(w)
print(filtered_text)

We can see that the words “is”, “to”, “be”, “the” and “or” were removed from our text. Let’s update the cumulative frequency distribution of words:

Removing stop words should be done in a very conscious way, since it can bring huge problems while performing other tasks like sentiment analysis. If a word’s context is affected (e.g. by removing the word ‘not’, which is a negation of a component), that action can alter the meaning of the passage.

Beyond this example, it could be necessary to deal with other types of characteristics like contractions (like the word “doesn’t”, which should be expanded), or accents and diacritics (like the words “cliché” or “naïve”, which should be normalized by removing their diacritics).

Regular Expressions

Regular expressions (called REs, or RegExes) are a tiny, highly specialized programming language embedded inside Python and made available through the re module. By using them, you specify the rules for the set of possible strings that you want to match. You can ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”.

For example, let’s search for words ending with “st”:

import re
[w for w in filtered_text if re.search(‘st$’, w)]

Or count the number of vowels in the first word (“artificial”):

len(re.findall(r’[aeiou]’, filtered_text[0]))

You can even modify texts based on conditions. For example, replace letters “ce” with letter “t” in the second word (“intelligence”):

x = re.sub('ce', 't', filtered_text[1])
print(x)

You can find more examples of regular expressions following this link.

Conclusion

We’ve only scratched the surface of all the possible and more complex NLP techniques out there. And it’s not just structured texts that you may want to analyze, but all that data generated from conversations, declarations or even tweets, which are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases, and represent the vast majority of data available in the actual world. It is messy and hard to manipulate.

NLP is seriously booming thanks to the huge improvements in the access to data and the increase in computational power, which allow us to achieve meaningful results in areas like healthcare, media, finance and human resources, among others.

My suggestion is: learn about NLP. Try different data sources and techniques. Experiment, fail and improve yourself. This discipline will impact every possible industry, and we are likely to reach a level of advancement in the coming years that will blow our minds.

Interested in these topics? Follow me on Linkedin or Twitter

Bio: Diego Lopez Yse is an experienced professional with a solid international background acquired in different industries (capital markets, biotechnology, software, consultancy, government, agriculture). Always a team member. Skilled in Business Management, Analytics, Finance, Risk, Project Management and Commercial Operations. MS in Data Science and Corporate Finance.

Original. Reposted with permission.

Related: