An Introduction to NLP and 5 Tips for Raising Your Game

This article is a collection of things the author would like to have known when they started out in NLP. Perhaps it will be useful for you.

For those working around Data Science, Machine Learning and/or Artificial Intelligence, NLP is probably one of the most exciting fields to work in.


NLP stands for Natural Language Processing and it’s about the interactions between computers and human languages. Programming algorithms capable of processing and analyzing large amounts of natural language data.


The underlying objective may vary, but the overall goal is to get to conclusions about human behaviour…our intentions when writing something, what we were thinking or feeling when we do it, the category of an item we were writing about, and some other stuff like chatbots, market segmentation of customers, find duplicates and similarities in between elements, virtual assistants (like Siri or Alexa) and much more stuff.

Nonetheless, NLP as a subject didn’t appear much time ago, it was just in 1960 when Alan Turing published an article called “Computing Machinery and Intelligence” which proposed what is now called the ‘Turing test’. The paper introduced the question ‘Can machines think?’ and the test proves a machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Three participants are necessary for running the test, where a player C, the evaluator, is given the task of trying to determine which player — A or B — is a computer and which is a human.


The Japanese robot Pepper, made by Aldebaran Robotics


The evaluator would then judge natural language conversations between a human and a machine designed to generate human-like responses, knowing that one of the two partners in conversation is a machine. The conversation would be limited to a text-only channel and the results do not depend on the machine’s ability to give correct answers to questions, only how closely its answers resemble those a human would give. If at the end of the test the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test.

Starting from there, in the past years, the field has evolved exponentially, going from hand-coded systems using a set of rules, to a more sophisticated statistical NLP. And in this context, some companies are doing some pretty exciting stuff in the field. For example, if you’re an Android user you’re probably familiar with Swiftkey, a startup using text prediction designed to boost the accuracy, fluency and speed of users’ writing. Swiftkey learns from our writing, predicting favourite words, emojis and even expressions. Another startup, SignAll, converts sign language into text. Helping individuals who are deaf communicate with those who don’t know sign language.

And the fact is that nowadays the expansion of some open source libraries using Python, Tensorflow, Keras and others, has made NLP accessible and each day more and more businesses are using it. Some of them hiring other companies specifically specialized in the subject, but some others are hiring Data Scientists and Data Analyst in order to build their own solutions.

If any of these is your case, whether you are the company or the data specialist, in the next lines I will introduce some of my learning while working with NLP. Lucky for you, all of them are mistake-based tips! So hopefully, you will be able to avoid them in advance, not as it happened to me :)


1. Find the right type of vectorization for you

In NLP usually, after lots and lots and lots (and probably lots more) of data cleaning, the magic starts with something called vectorization. This tool, technique, or however you want to call it, take a bunch of text, usually called documents, and transforms them in vectors according to the words appearing within each document. Take the following example:


Example created by the author using images from


In the example above we are using a tool known as Count Vectorizer or Bag of Words. This kind of vectorization usually discards grammar, order, and structure in the text. It is a great option since it keeps track of all words appearing within the documents and their simple way of processing them by just counting it’s easily understandable and gives us a clear picture of the most important words overall. However, it presents two main problems:

  • Data sparsity: when counting all appearances throughout documents, we can easily end with a matrix composed of vectors full of zeros since of course, each document will only contain a small amount of all the possible words. We’ll talk more about this later.
  • The future itself: a Count Vectorizer outputs a fixed-sized matrix with all words (or those of certain frequencies) appearing in our current documents. This could be a problem if we receive further documents in the future and we don’t know the words we might find.
  • Tricky documents: what happens if we have a document in which a specific word appears so many times, that it ends looking as it is the most common word throughout all the documents instead of just a word appearing lots of time in just one document?

To solve the first problem and the second problem we could use a Hashing Vectorize, which converts a collection of text documents to a matrix of occurrences calculated with the hashing trick. Each word is mapped to a feature with the use of a hash function that converts it to a number. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurrences without retaining a dictionary in memory. The main drawback of this trick is that it’s not possible to compute the inverse transform, and thus we lose information on what words the important features correspond to.

To solve the third problem mentioned above, we could use term frequency-inverse document frequency (tf-idf) vectorizer. A tf-idf score tells us which words are most discriminating between documents. Words that occur a lot in one document but don’t occur in many documents contain a great deal of discriminating power. The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. Enhancing terms highly specific of a particular document while suppressing terms that are common to most documents.

Sklearn has implementations for all these three types of vectorization:


2. Personalize stop words and be aware of the language in your data

Using stop words when doing any kind of vectorization is a key step for getting reliable results. Passing a list of stop words to our algorithm we’re telling to it: ‘please ignore all these words if you find any…I don’t want to have them in my output matrix’. Skelarn does include a default list of stop words for us to use, just by passing the word ‘english’ to the ‘stop_words’ hyperparameter. However, there are several limitations:

  • It only includes basic words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text and which may be removed to avoid them being construed as a signal for prediction. However, if for example, you were processing descriptions of houses scraped from rental agencies websites, you would probably want to remove all words that d not make to the description of the property itself. Words as ‘opportunity’, ‘offer’, ‘amazing’, ‘great’, ‘now’ and stuff like that
  • And what has been for me the greatest drawback being a Spanish speaker and working with machine learning problems in that language: It’s only available in English

So whether you want to enrich the default list of words in English to improve your output matrix, or you want to use a list in some other language, you can pass Sklearn’s algorithm a personalized list of stop words by using the hyperparameter ‘stop_words’. By the way, here’s a GitHub repository with an impressive number of lists in several languages.

Before jumping into the next point, just have in mind that sometimes you won’t want to use any stop words at all. For example, if you’re dealing with numbers, even the default English list of stop words within Sklearn includes all single numbers from 0 to 9. So it’s important for you to ask yourself whether or not you’re working on an NLP problem that needs stop words.


3. Use stemmer for ‘grouping’ similar words

Text normalization is the process of converting slightly different versions of words with essentially equivalent meaning into the same features. In some cases, it might be sensible to consider all possible variants of a possible word, but whether you’re working in English or any other language, sometimes you’ll also want to do some kind of pre-processing to your document in order to represent in the same way words that underly the same meaning. For example, consultant, consulting, consult, consultative and consultants could all be expressed as just ‘consultant’. See the next table for more examples:


Source: — Generate and verify stemmed words


For doing this, we could use stemming. Stemmers remove morphological affixes from words, leaving only the word stem. Luckily for us, NLTK library for Python contains several robust stemmers. And if you want to incorporate your specific-language stemmer, or other, into your vectorizer algorithm, you can just use the following bunch of code:

spanish_stemmer = SpanishStemmer()classStemmedCountVectorizerSP(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizerSP,self).build_analyzer()return lambda doc: ([spanish_stemmer.stem(w) for w in analyzer(doc)])

You can easily change this for using HashingVectorizer or TfidfVectorizer, just by changing the algorithm given to the class.


4. Avoid using Pandas DataFrames

This piece of advice is short and sweet: if you’e working in an NLP project with any data larger than 5–10k thousand rows, avoid using DataFrames. Just vectorizing a big number of documents using Pandas returns a massive matrix that makes handling very slow, but also, lots of times, Natural Language Processing projects involve stuff like measuring distances, what tends to be very slow since it needs to compare elements against each other. And even though I’m myself a heavy user of Pandas’ DataFrames, for this kind of stuff I would recommend using Numpy Arrays or Sparse Matrices.

Also. mind that you can always get your sparse matrix to an array just by using the ‘.toarray()’ function and vice-versa, from array to sparse matrix using:

from scipy import sparsemy_sparse_matrix = sparse.csr_matrix(my_array)

By the way, if your dealing with time issues, remember you can time your code using the following:

start = time.time()whatever_you_want_to_timeend = time.time()print(end — start)


5. Data sparsity: make your output matrix usable

Image for post

As said before, one of the biggest problems while working with NLP is the issue of data sparsity…ending with matrices of dozens of thousands of columns full of zeros, that make it impossible for us to apply certain stuff afterwards. Here are a couple of tips of things I have used in the past for dealing with this problem:

  • When using TfidfVectorizer or CountVectorizer using the hyperparameter ‘max_features’. For example, you could print out the words frequencies across documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the words frequencies 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go. If max_features is set to None, then the whole corpus is considered during the transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words across text documents.
  • Setting up a number of ‘n_features’ in HashingVectorizer. This hyperparameter sets the number of features/columns in the output matrix. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners. The number is up to you and what you need.
  • Using dimensionality reduction. Techniques as Principal Component Analysis take an output matrix with dozens of thousands of columns into a much smaller set capturing the variance on the original matrix could be a great idea. Just mind analyzing how much this dimensionality reduction affects your final results, to check if it’s actually useful and also to select the number of dimensions to be used.

I really really hope all these learnings I have had might help in your NLP project. More stories about NLP will come in the future, but if you enjoy this story don’t forget to check out some of my last articles, like how to divide your data into train and test set assuring representativeness, survivorship bias in Data Science and using a cluster in the cloud for Data Science projects in 4 simple steps. All of them and more available within my Medium profile.

And if you want to receive my latest articles directly on your email, just subscribe to my newsletter :)

Thanks for reading!

And a special mention to the following sources I used throughout the story:

Bio: After 5+ years of experience in eCommerce and Marketing across multiple industries, Gonzalo Ferreiro Volpi pivoted into the world of Data Science and Machine Learning, and currently works at Ravelin Technology using a combination of machine learning and human insights to tackle fraud in eCommerce.

Original. Reposted with permission.