Introduction to Natural Language Processing (NLP)
Have you ever wondered how your personal assistant (e.g: Siri) is built? Do you want to build your own? Perfect! Let’s talk about Natural Language Processing.
So, What is Natural Language Processing (NLP)?
NLP is an interdisciplinary field concerned with the interactions between computers and natural human languages (e.g. English) — speech or text. NLP-powered software helps us in our daily lives in various ways, for example:
- Personal assistants: Siri, Cortana, and Google Assistant.
- Auto-complete: In search engines (e.g. Google).
- Spell checking: Almost everywhere, in your browser, your IDE (e.g. Visual Studio), desktop apps (e.g. Microsoft Word).
- Machine Translation: Google Translate.
Okay, now we get it, NLP plays a significant role in our daily computer interactions; let’s take a look at some example business-related use-cases for NLP:
- Fast-food chains receive a vast amount of orders and complaints daily; manually handling this will be tiresome and repetitive, also inefficient in terms of time, labour and cost. Thanks to recent advancements in conversational AI, they can build virtual assistants that automate such processes and reduce human intervention.
- Brands launch new products and market them on social media platforms; they can measure campaigns’ success rates using metrics such as reach and number of interactions. Still, they can’t understand the consumers’ public sentiment automatically. This task can be automated using sentiment analysis, a text classification task where machine learning models are trained to quantify affective states and subjective information.
NLP is mainly divided into two fields: Linguistics and Computer Science.
The Linguistics side focuses on understanding the structure of language, including the following sub-fields [Bender, 2013]:
- Phonetics: The study of the sounds of human language.
- Phonology: The study of the sound systems in human languages.
- Morphology: The study of the formation and internal structure of words.
- Syntax: The study of the formation and internal structure of sentences.
- Semantics: The study of the meaning of sentences.
- Pragmatics: The study of the way sentences with their semantic meanings are used for particular communicative goals.
The Computer Science side is concerned with translating linguistic knowledge and domain expertise into computer programs with the help of sub-fields such as Artificial Intelligence.
Let’s Talk Science
Scientific advancements in NLP can be divided into rule-based systems, classical machine learning models, and most recently, deep learning models.
- Rule-based systems rely heavily on crafting domain-specific rules (e.g. regular expressions). It can automate simple tasks such as extracting structured data (e.g. dates, names) from unstructured data (e.g. webpages, emails). However, due to the complexity of human languages, rule-based systems aren’t robust, hard to maintain, and can’t generalize across different domains.
- Classical machine learning approaches can solve more challenging problems (e.g. spam detection). Using feature engineering (e.g. bag of words, part of speech tags) to build machine learning models (e.g. Naive Bayes). Those models exploit systematic patterns in training data and can make predictions for unseen data.
- Deep learning models are currently the most popular in NLP research and applications. They generalize even better than the classical machine learning approaches. It doesn’t need hand-crafted features or feature engineering because they automatically work as feature extractors, enabling end-to-end model training. Deep learning models learning capabilities are more powerful than shallow/classical ML ones, which paved its way to achieving the highest scores on various challenging NLP tasks (e.g. Machine Translation).
How Computers Understand Text?
Computers understand numbers, but can’t understand characters, words, or sentences, so an intermediate step is needed before building NLP models, which is text representation. I will focus on word-level representations, as they are easy to understand. Other representations techniques are used, such as characters or subwords.
In the classical NLP/ML era (before deep learning), text representation techniques were mainly built on a basic idea: one-hot encodings, where a sentence is represented by a matrix of shape (N x N), where N is the number of unique tokens in the sentence. For example, in the image above, the sentence (the cat sat on the mat) is represented as a set of sparse vectors (mostly zeroes). This approach has two significant drawbacks:
- The huge memory capacity issues, because of the sparse representation matrix.
- Lack of semantic understanding. It can’t understand relationships between words (e.g. school and book).
In 2013, researchers from Google introduced a new model for text representation, which was revolutionary in NLP, named word2vec [Mikolov et al., 2013]. This shallow, deep learning model can represent words in dense vectors and capture semantic meaning between related terms (e.g. Paris and France, Madrid and Spain). Further research has built on top of word2vec, such as GloVe [Pennington et al., 2014] and fastText [Bojanowski et al., 2016].
In late 2018, researchers from Google, again, came up with another model (BERT), which is considered the basis for state-of-the-art NLP research nowadays [Devlin et al., 2019], entirely based on the Transformer architecture [Vaswani et al., 2017].
Tasks & Research
Let’s look at some NLP tasks and categorize them based on the research progress for the English language (read: #BenderRule).
1. Mostly Solved:
- Text Classification (e.g. spam detection in Gmail).
- Part of Speech (POS) tagging: Given a sentence, determine the POS tag for each word (e.g. NOUN, VERB, ADV, ADJ).
- Named Entity Recognition (NER): Given a sentence, determine named entities (e.g. person names, locations, organizations).
2. Making a Solid Progress:
- Sentiment Analysis: Given a sentence, determine it’s polarity (e.g. positive, negative, neutral), or emotions (e.g. happy, sad, surprised, angry, etc)
- Co-reference Resolution: Given a sentence, determine which words (“mentions”) refer to the same objects (“entities”). for example (Manning is a great NLP professor, he worked in the field for over two decades).
- Word Sense Disambiguation (WSD): Many words have more than one meaning; we have to select the meaning which makes the most sense based on the context (e.g. I went to the bank to get some money), here bank means a financial institution, not the land beside a river.
- Machine Translation (e.g. Google Translate).
3. Still Challenging:
- Dialogue agents and chat-bots, especially open-domain ones.
- Question Answering.
- Abstractive Summarization.
- NLP for low-resource languages (e.g. African languages, see Masakhane and ∀ et al., 2020).
NLP Live Demos
- SpaCy Named Entity Recognition
- SpaCy Semantic Similarity
- AllenNLP Sentiment Analysis
- AllenNLP Text to SQL
- HuggingFace Write with a Transformer
- HuggingFace Multi-Task model
- Stanford CoreNLP Sentiment Analysis
An Online, Comprehensive Study Plan
- Learn to code in Python (Udacity course).
- Mathematical foundations (Linear Algebra, Probabilities, Statistics), Khan Academy tracks will be sufficient. I discussed how to study math for machine learning in more details in this article.
2. Machine/Deep Learning:
- Classical Machine Learning (Andrew Ng’ class on Coursera, and solve the assignments using Python).
- Deep Learning (Andrew Ng’ specialization on Coursera).
3. Natural Language Processing:
- Classical NLP, although it’s not very widely used nowadays, it covers some essential concepts and timeless techniques (Jurafsky and Manning class).
- Stanford CS224N: NLP with Deep Learning.
- CMU CS11–747: Neural Nets for NLP.
- CMU Low Resource NLP Bootcamp.
- Emily Bender’s Linguistics books (Part I: Morphology and Syntax, Part II: Semantics and Pragmatics).
- Linguistics Crash Course on YouTube.
- Universiteit Leiden: Miracles of Human Language: An Introduction to Linguistics, an online class on Coursera.
5. Miscellaneous Topics:
- A collection of resources for Ethics in NLP.
- Energy consumption and environmental issues: Energy and Policy Considerations for Deep Learning in NLP, Sturbell et al., 2019.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜, Bender, Gebru et al., 2021.
- Will large-scale pretrained models solve language? NLP’s Clever Hans Moment has Arrived.
More into Books?
- Dan Jurafsky and James H. Martin, Speech and Language Processing.
- Jacob Eisenstein’s Natural Language Processing.
- Yoav Goldberg’s tutorial, A Primer on Neural Network Models for Natural Language Processing.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning.
Let’s Hack Some Code!
Now we have covered what NLP is, and the science behind it, let’s get to the practical part. Here’s a list of the top widely used open-source libraries to use in your next project (in no particular order):
I hope this piece gives you a decent general understanding of such an exciting field. If you have any suggestions or questions, kindly leave them in the responses or reach out to me [ibrahimsharaf.github.io].
Bio: Ibrahim Sharaf ElDen (@_Sharraf )is a Research Engineer at Mawdoo3.com, and operates somewhere at the intersection of SWE, ML, and NLP. He is also a writer at Towards Data Science. See his GitHub projects here.
Original. Reposted with permission.
- Prefect: How to Write and Schedule Your First ETL Pipeline with Python
- Build a synthetic data pipeline using Gretel and Apache Airflow
- Development & Testing of ETL Pipelines for AWS Locally