Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2018 » Apr » Tutorials, Overviews » Why Deep Learning is perfect for NLP (Natural Language Processing) ( 18:n17 )

Silver BlogWhy Deep Learning is perfect for NLP (Natural Language Processing)


Deep learning brings multiple benefits in learning multiple levels of representation of natural language. Here we will cover the motivation of using deep learning and distributed representation for NLP, word embeddings and several methods to perform word embeddings, and applications.



Sponsored Post.
By Wei Di, Anurag Bhardwaj & Jianing Wei

This tutorial is an excerpt from "Deep Learning Essentials" by Wei Di, Anurag Bhardwaj, Jianing Wei and published by Packt. Use the code ORKDNA10 at checkout to get the recommended eBook for just $10 until May 31, 2018.

 

Motivation and distributed representation

 
Like in many other cases, the representation of the data, which is how the information is encoded and shown to machine learning algorithms, is often the most important and fundamental part in all pipelines of learning or AI. The effectiveness and scalability of the representation largely determine for the performance of the downstream machine learning model and application.

As mentioned in the previous section, traditional NLP often uses one-hot encoding to represent the word in a fixed vocabulary and uses a BoW to represent documents. Such an approach treats each word as, for example, house, road, tree, as an atomic symbol. The one-hot encoding will generate representations like [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]. The length of the representation is the size of the vocabulary. With such representation, one often ends up with huge sparse vectors. For example, in a typical speech application, vocabulary size can be from 20,000 to 500,000. However, it has an obvious problem, which is that the relationship between any pair of words is ignored, for example, motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] and hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0. Also, encodings are actually arbitrary, for example in one setting, cat may be represented as Id321 and dog as Id453, meaning the 453rd entry of the long sparse vector is 1. Such representation provides no useful information to the system regarding the interactions or similarities that may exist between individual symbols.

This makes the learning of the model difficult because the model won’t be able to leverage much of what it has learned about cat when it is processing the data regarding dog. Therefore, discrete IDs separate the actual semantic meaning of the word from its representation. Although some statistical information can be calculated at the document level, information at the atomic level is extremely limited. This is where distributed vector representation, and deep learning in particular, comes to help.

Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction.

There are multiple benefits we get from using deep learning for NLP problems:

  • As often directly derived from the data or the problem, improve the incompleteness and over-specification of a hand-crafted feature. Handcrafting features is often very time consuming and may need to be performed again and again for each task or domain-specific problem. Features learned from one field often show little generalization ability toward other domains or areas. On the contrary, deep learning learns information from the data and the representation across multiple levels, in which the lower level corresponds to more general information that can be leveraged by other areas directly or after fine-tuning.
  • Learning features that are not mutually exclusive can be exponentially more efficient than nearest-neighbour-like or clustering-like models. Atomic symbol representations do not capture any semantic interrelationship between words. With words being treated independently, NLP systems can be incredibly fragile. The distributional representation that captures these similarities in a finite vector space provides the opportunity to the following NLP system to do more complex reasoning and knowledge derivation.
  • Learning can be done unsupervised. Given the current scale of data, there is a great need for unsupervised learning. It is often not realistic to acquire labels in many practical cases.
  • Deep learning learns multiple levels of representation. This is one of the most important advantages of deep learning, for which the learned information is constructed level-by-level through composition. The lower level of representation often can be shared across tasks.
  • Naturally handles the recursivity of human language. Human sentences are composed of words and phrases with a certain structure. Deep learning, especially recurrent neural models, is able to capture the sequence information in a much better sense.

 

Word embeddings

 
The very fundamental idea of distributional similarity-based representations is that a word can be represented by the means of its neighbors. As said by J. R. Firth 1957: 11:

You shall know a word by the company it keeps

This is perhaps one of the most successful ideas of modern statistical NLP. The definition of neighbors can vary to take account of either local or larger contexts to get a more syntactic or semantic representation.

Idea of word embeddings

First of all, a word is represented as a dense vector. A word embedding can be thought as a mapping function from the word to an n-dimensional space; that is,

Image

, in which W is a parameterized function mapping words in some language to high-dimensional vectors (for example, vectors with 200 to 500 dimensions). You may also consider W as a lookup table with the size of V X N, where V is the size of the vocabulary and N is the size of the dimension, and each row corresponds to one word. For example, we might find:

W("dog")=(0.1, -0.5, 0.8, ...)
W("mat")=(0.0, 0.6, -0.1, ...)

Here, W is often initialized to have random vectors for each word, then we let the network learn and update Win order to perform some tasks.

For example, we can train the network to let it predict whether an n-gram (sequence of n words) is valid. Let’s say we got one sequence of words as a dog barks at strangers, and take this as an input with a positive label (meaning valid). We then replace some of the words in this sentence with the random word and transfer it to a cat barks at strangers, and the label is as negative since that will almost certainly mean that this 5-gram is nonsensical:

Image

As shown in the preceding figure, we train the model by feeding the n-gram through the lookup matrix W and get a vector that represents each word. The vectors are then combined through the output neuron, and we compare its results with the target value. A perfect prediction would result in the following:

R(W("a"), W(‘‘dog"), W(‘barks"), W(‘‘at”), W("strangers"))=1
R(W("a"), W(‘‘cat"),  W(‘barks"), W(‘‘at”), W("strangers"))=0

The differences/errors between the target value and the prediction will be used for updating W and R (the aggregation function, for example, sum).

The learned word embeddings have some interesting properties.

First, the location of the word representations in the high-dimensional space is determined by their meanings, such that words with close meanings are clustered together:

Image

Linear relationship between embeddings learned from the model

Second, which is even more interesting, is that word vectors have linear relationships. The relationship between words can be thought as the direction and the distance formed by a pair of words. For example, starting from the location of the word king, move the same distance and direction between man and woman, and one will get the word queen, that is:

[king] - [man] + [woman] ~ = [queen]

Researchers found that if training using a large amount of data, the resulting vectors can reflect very subtle semantic relationships, such as a city and the country it belongs to. For example, France is to Paris as Germany is to Berlin.

Another example is to find a word that is similar to small in the same sense as biggest is similar to big. One can simply compute vector X = vector(biggest) − vector(big) + vector(small). Many other kinds of semantic relationships can also be captured, such as opposite and, comparative. Some nice examples can be found in Mikolov’s publication, Efficient Estimation of Word Representations in Vector Space (https://arxiv.org/pdf/1301.3781.pdf), as shown in the following figure:

Image

An example of five types of semantic and nine types of syntactic questions in the Semantic Syntactic Word Relationship test set from the paper by Mikolov and their co-authors in Efficient Estimation of Word Representations in Vector Space.

Advantages of distributed representation

There are many advantages of using distributed word vectors for NLP problems. With the subtle semantic relationships being captured, there is great potential in improving many existing NLP applications, such as machine translation, information retrieval, and question answering system. Some obvious advantages are:

  • Capturing local co-occurrence statistics
  • Produces state-of-the-art linear semantic relationships
  • Efficient use of statistics
  • Can train on (comparably) little data and gigantic data
  • Fast, only non-zero counts matter
  • Good performance with small (100-300) dimension vectors that are important for downstream tasks

Problems of distributed representation

Keep in mind that no approach can solve everything, and similarly, a distributed representation is not a silver bullet. To use it properly, we need to understand some of its known issues:

  • Similarity and relatedness are not the same: With great evaluation results presented in some publications, there is no guarantee for the success of its practical application. One reason is that the current standard evaluation is often on the degree of correlation versus a set of words created by humans. It’s possible that representations from the model correlate with human evaluation well, but do not boost performance given a specific task. This is perhaps caused by the fact that most evaluation datasets don’t distinguish between word similarity and relatedness. For example, male and man are similar, whereas computer and keyboard are related but dissimilar.
  • Word ambiguity: This problem occurs when words have multiple meanings. For example, the word bank has the meaning of sloping land in addition to the meaning of a financial institution. In this way, there is a limit to representing a word as one vector without considering word ambiguity. Some approaches have been proposed to learn multiple representations for each word. For example, Trask and their co-authors proposed a method that models multiple embeddings for each word based on supervised disambiguation (https://arxiv.org/abs/1511.06388). One can refer to those approaches when it’s necessary for a task.

Some popular pre-trained word embeddings are Word2Vec, GloVe, FastText, LexVec and Meta-Embeddings.

In the following sections, we will mainly talk about three popular ones: Word2Vec, GloVe, and FastText. In particular, we will dive deeper into Word2Vec for its core ideas, its two distinct models, the process of training, and how to leverage the open source pre-trained Word2Vec representations.

This tutorial is an excerpt from "Deep Learning Essentials" by Wei Di, Anurag Bhardwaj, Jianing Wei and published by Packt. Use the code ORKDNA10 at checkout to get the recommended eBook for just $10 until May 31, 2018.


Sign Up