Introduction to Natural Language Processing (NLP)
Have you ever wondered how your personal assistant (e.g: Siri) is built? Do you want to build your own? Perfect! Let’s talk about Natural Language Processing.
By Ibrahim Sharaf ElDen, Research Engineer at Mawdoo3.com
So, What is NLP?
NLP is an interdisciplinary field concerned with the interactions between computers and human natural languages (e.g: English) — speech or text. NLP-powered softwares help us in our daily lives in various ways, for example:
- Personal assistants: Siri, Cortana, and Google Assistant.
- Auto-complete: In search engines (e.g: Google, Bing).
- Spell checking: Almost everywhere, in your browser, your IDE (e.g: Visual Studio), desktop apps (e.g: Microsoft Word).
- Machine Translation: Google Translate.
Okay, now we get it, NLP plays a major role in our daily computer interactions, let’s see more business-related NLP use-cases:
- If you have a bank or a restaurant with a huge load of customers orders and complaints, handling this in a manual way will be tiresome and repetitive, also not very efficient in terms of time and labor, so you can build a chat-bot for your business, which will automate such process and reduce human interaction.
- Apple will soon launch the new iPhone 11, and they will be interested to know what users are thinking of the new iPhone, so they can monitor social media channels (e.g: Twitter), and extract iPhone 11 related tweets, reviews, and opinions, then use sentiment analysis models to predict whether users’ reviews are positive, negative, or neutral.
NLP is divided into two fields: Linguistics and Computer Science.
The Linguistics side is concerned with language, it’s formation, syntax, meaning, different kind of phrases (noun or verb) and whatnot.
The Computer Science side is concerned with applying linguistic knowledge, by transforming it into computer programs with the help of sub-fields such as Artificial Intelligence (Machine Learning & Deep Learning).
Let’s Talk Science!
Scientific advancements in NLP can be divided into 3 categories (Rule-based systems, Classical Machine Learning models and Deep Learning models).
- Rule-based systems rely heavily on crafting domain-specific rules (e.g: regular expressions), can be used to solve simple problems such as extracting structured data (e.g: emails) from unstructured data (e.g: web-pages), but due to the complexity of human natural languages, rule-based systems fail to build models that can really reason about language.
- Classical Machine Learning approaches can be used to solve harder problems which rule-based systems can’t solve very well (e.g: Spam Detection), it rely on a more general approach to understanding language, using hand-crafted features (e.g: sentence length, part of speech tags, occurrence of specific words) then providing those features to a statistical machine learning model (e.g: Naive Bayes), which learns different patterns in the training set and then be able to reason about unseen data (inference).
- Deep Learning models are the hottest part of NLP research and applications now, they generalize even better than the classical machine learning approaches as they don’t need hand-crafted features because they work as feature extractors in an automatic way, which helped a lot in building end-to-end models (little human-interaction). Aside from the feature engineering part, deep learning algorithms learning capabilities are more powerful than the shallow/classical ML ones, which paved its way to achieving the highest scores on different hard NLP tasks (e.g: Machine Translation).
How Does Computer Understand Text?
We do know that computers only understand numbers, not characters, words, or sentences, so an intermediate step is needed before building NLP models, which is text representation. I will focus on word-level representations, as it’s the most widely used and intuitive ones to start with, other representations can be used such as bit, character, sub-word, and sentence level representations).
- In traditional NLP era (before deep learning) text representation was built on a basic idea, which is one-hot encodings, where a sentence is represented as a matrix of shape (NxN) where N is the number of unique tokens in the sentence, for example in the above picture, each word is represented as a sparse vectors (mostly zeroes) except of one cell (could be one, or the number of occurrences of the word in the sentence). This approach has two major drawbacks, the first one is the huge memory capacity issues (hugely sparse representation), the second one is its lack of meaning representation, such that it can’t derive similarities between words (e.g: school and book).
- In 2013, researchers from Google (lead by Thomas Mikolov), has invented a new model for text representation (which was revolutionary in NLP), called word2vec, a shallow deep learning model which is able to represent words in dense vectors, and capture semantic meaning between related terms (e.g: Paris and France, Madrid and Spain). Further research has built on top of word2vec, such as GloVe, fastText.
Tasks & Research
Let’s take a look at some NLP tasks and categorize them based on the research progress for each task.
1) Mostly solved:
- Spam Detection (e.g: Gmail).
- Part of Speech (POS) tagging: Given a sentence, determine POS tags for each word (e.g: NOUN, VERB, ADV, ADJ).
- Named Entity Recognition (NER): Given a sentence, determine named entities (e.g: person names, locations, organizations).
2) Making good progress:
- Sentiment Analysis: Given a sentence, determine it’s polarity (e.g: positive, negative, neutral), or emotions (e.g: happy, sad, surprised, angry)
- Co-reference Resolution: Given a sentence, determine which words (“mentions”) refer to the same objects (“entities”). for example (Manning is a great NLP professor, he worked in academia for over 25 years).
- Word Sense Disambiguation (WSD): Many words have more than one meaning, we have to select the meaning which makes the most sense in context (e.g: I went to the bank to get some money), here bank means a financial institution, not the land beside a river.
- Machine Translation (e.g: Google Translate)
3) Still a bit hard:
- Dialogue agents and chat-bots, especially open domain ones.
- Question Answering.
- NLP for low resource languages.
NLP Online Demos
- SpaCy Named Entity Recognition: https://explosion.ai/demos/displacy-ent
- SpaCy Semantic Similarity: https://explosion.ai/demos/sense2vec
- AllenNLP Sentiment Analysis: https://demo.allennlp.org/sentiment-analysis
- AllenNLP Text to SQL: https://demo.allennlp.org/atis-parser
- All AllenNLP demos: https://demo.allennlp.org
- HuggingFace Write with a Transformer: https://transformer.huggingface.co
- HuggingFace Multi-Task model: https://huggingface.co/hmtl
- Stanford CoreNLP Sentiment Analysis: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
- Mawdoo3 Arabic NLP demos (Disclaimer: I work there :D): https://ai.mawdoo3.com/apis
A Comprehensive Study Plan
- Learn to code in Python (Udacity course).
- Study the needed mathematical foundations (Linear Algebra, Probabilities, Statistics), Khan Academy tracks will be sufficient.
- Traditional NLP, although it’s not very widely used nowadays, but it covers some important concepts and techniques that you’ll need later (Jurafsky & Manning course).
- Classical Machine Learning (Andrew Ng class on Coursera, solve the assignments using Python).
- Deep Learning (Andrew Ng specialization on Coursera).
- NLP with Deep Learning (Stanford CS224n class).
- Fast.ai Code first intro to NLP.
- Stanford Natural Language Understanding.
- CMU Neural Nets for NLP.
- Oxford Deep Learning for NLP.
If you are into books
- Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)
- Jacob Eisenstein. Natural Language Processing
- Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning
Let’s Hack Some Code!
Now we have covered what is NLP, the science behind it and how to study it, let’s get to the practical part, here’s a list of the top widely used open source libraries to use in your next project.
- SpaCy (https://github.com/explosion/spaCy)
- Gensim (https://github.com/RaRe-Technologies/gensim)
- AllenNLP (https://github.com/allenai/allennlp)
- TextBlob (https://github.com/sloria/TextBlob)
- NLTK (https://github.com/nltk/nltk)
- CoreNLP (https://github.com/stanfordnlp/CoreNLP)
So that was an end-to-end introduction to Natural Language Processing, hope that helps, and if you have any suggestions, please leave them in the responses. Cheers!
Bio: Ibrahim Sharaf ElDen (@_Sharraf )is a Research Engineer at Mawdoo3.com, and operates somewhere at the intersection of SWE, ML, and NLP. He is also a writer at Towards Data Science. See his GitHub projects here.
Original. Reposted with permission.
- 10 Free Top Notch Natural Language Processing Courses
- Natural Language in Python using spaCy: An Introduction
- Beyond Word Embedding: Key Ideas in Document Embedding