Deep Learning for NLP: An Overview of Recent Trends
A new paper discusses some of the recent trends in deep learning based natural language processing (NLP) systems and applications. The focus is on the review and comparison of models and methods that have achieved state-of-the-art (SOTA) results on various NLP tasks and some of the current best practices for applying deep learning in NLP.
Recurrent Neural Network (RNN)
RNNs are specialized neural-based approaches that are effective at processing sequential information. An RNN recursively applies a computation to every instance of an input sequence conditioned on the previous computed results. These sequences are typically represented by a fixed-size vector of tokens which are fed sequentially (one by one) to a recurrent unit. The figure below illustrates a simple RNN framework below.
The main strength of an RNN is the capacity to memorize the results of previous computations and use that information in the current computation. This makes RNN models suitable to model context dependencies in inputs of arbitrary length so as to create a proper composition of the input. RNNs have been used to study various NLP tasks such as machine translation, image captioning, and language modeling, among others.
As it compares with a CNN model, an RNN model can be similarly effective or even better at specific natural language tasks but not necessarily superior. This is because they model very different aspects of the data, which only makes them effective depending on the semantics required by the task at hand.
The input expected by a RNN are typically one-hot encodings or word embeddings, but in some cases they are coupled with the abstract representations constructed by, say, a CNN model. Simple RNNs suffer from the vanishing gradient problem which makes it difficult to learn and tune the parameters in the earlier layers. Other variants, such as long short-term memory (LSTM) networks, residual networks (ResNets), and gated-recurrent networks (GRU) were later introduced to overcome this limitation.
RNN Variants: An LSTM consist of three gates (input, forget, and output gates), and calculate the hidden state through a combination of the three. GRUs are similar to LSTMs but consist of only two gates and are more efficient because they are less complex. A study shows that it is difficult to say which of the gated RNNs are more effective, and they are usually picked based on the computing power available. Various LSTM-based models have been proposed for sequence to sequence mapping (via encoder-decoder frameworks) that are suitable for machine translation, text summarization, modeling human conversations, question answering, image-based language generation, among other tasks.
Overall, RNNs are used for many NLP applications such as:
- Word-level classification (e.g., NER)
- Language modeling
- Sentence-level classification (e.g., sentiment polarity)
- Semantic matching (e.g., match a message to candidate response in dialogue systems)
- Natural language generation (e.g., machine translation, visual QA, and image captioning)
Essentially, the attention mechanism is a technique inspired from the need to allow the decoder part of the above-mentioned RNN-based framework to use the last hidden state along with information (i.e., context vector) calculated based on the input hidden state sequence. This is particularly beneficial for tasks that require some alignment to occur between the input and output text.
Attention mechanisms have been used successfully in machine translation, text summarization, image captioning, dialogue generation, and aspect-based sentiment analysis. Various different forms and types of attention mechanisms have been proposed and they are still an important research area for NLP researchers investigating various applications.
Recursive Neural Network
Similar to RNNs, recursive neural networks are natural mechanisms to model sequential data. This is so because language could be seen as a recursive structure where words and sub-phrases compose other higher-level phrases in a hierarchy. In such structure, a non-terminal node is represented by the representation of all its children nodes. The figure below illustrates a simple recursive neural network below.
In the basic recursive neural network form, a compositional function (i.e., network) combines constituents in a bottom-up approach to compute the representation of higher-level phrases (see figure above). In a variant, MV-RNN, words are represented by both a matrix and a vector, meaning that the parameters learned by the network represent the matrices of each constituent (word or phrase). Another variation, recursive neural tensor network (RNTN), enables more interaction between input vectors to avoid large parameters as is the case for MV-RNN. Recursive neural networks show flexibility and they have been coupled with LSTM units to deal with problems such as gradient vanishing.
Recursive neural networks are used for various applications such as:
- Leveraging phrase-level representations for sentiment analysis
- Semantic relationships classification (e.g., topic-message)
- Sentence relatedness
Reinforcement learning encompass machine learning methods that train agents to perform discrete actions followed by a reward. Several natural language generation (NLG) tasks, such as text summarization, are being investigated by employing reinforcement learning.
The applications of reinforcement learning on NLP problems are motivated by a few problems. When using RNN-based generators, ground-truth tokens are replaced by tokens generated by the model which quickly increases the error rates. Moreover, with such models, the word-level training objective differs from the test metric, such as n-gram overlap measure, BLEU, used in machine translation and dialogue systems. Due to this discrepancy, current NLG-type systems tend to generate incoherent, repetitive, and dull information.
To address the problems mentioned above, a reinforcement algorithm called REINFORCE was employed to address NLP tasks such as image captioning and machine translation. This reinforcement learning framework consists of an agent (RNN-based generative model) which interacts with the external environment (input words and context vectors seen at every time step). The agent picks an action based on a policy (parameters) which involves predicting the next word of a sequence at each time step. The agent then updates its internal state (hidden units of RNN). This continues until arriving at the end of the sequence where a reward is finally calculated. Reward functions vary by task; for instance, in a sentence generation task, a reward could be information flow.
Even though reinforcement learning methods show promising results, they require proper handling of the action and state space, which may limit the expressive power and learning capacity of the models. Keep in mind that standalone RNN-based models strive on their expressive power and their natural capability to model language.
Adversarial training has also been used to train language generators, where the objective is to fool a discriminator trained to distinguish generated sequences from real ones. Consider a dialogue system, with a policy gradient it is possible to frame the task under a reinforcement learning paradigm, where the discriminator acts like a human Turing tester. The discriminator is essentially trained to discriminate between human and machine-generated dialogues.
Unsupervised sentence representation learning involves mapping sentences to fixed-size vectors in an unsupervised manner. The distributed representations capture semantic and syntactic properties from language and are trained using an auxiliary task.
Similar to the algorithms used to learn word embeddings, a skip-thought model was proposed, where the task is to predict the next adjacent sentences based on a center sentence. This model is trained using the seq2seq framework where the decoder generate target sequences and the encoder is seen as a generic feature extractor — even word embeddings were learned in the process. The model essentially learns a distributed representation for the input sentences, analogous to how word embeddings were learned for every word in previous language modeling techniques.
Deep Generative Models
Deep generative models, such as variational autoenconders (VAEs) and generative adversarial networks (GANs), are also applied in NLP to discover rich structure in natural language through the process of generating realistic sentences from a latent code space.
It is well known that standard sentence autoencoders fail to generate realistic sentences due to the unconstrained latent space. VAEs impose a prior distribution on the hidden latent space, enabling a model to generate proper samples. VAEs consist of encoder and generator networks which encode an input into a latent space and then generate samples from the latent space. The training objective is to maximize a variational lower bound on the log likelihood of observed data under the generative model. The figure below illustrates an RNN-based VAE for sentence generation.
Generative models are useful for many NLP tasks and they are flexible in nature. For instance, an RNN-based VAE generative model was proposed to produce more diverse and well-formed sentences as compared to the standard autoencoders. Other models allowed structured variables (e.g., tense and sentiment) to be incorporated into the latent code, to generate plausible sentences.
GANs, composed of two competing networks — generator and discriminator — , have also been used to generate realistic text. For instance, an LSTM was used as the generator and a CNN was used as the discriminator which discriminated between real data and generated samples. The CNN represents a binary sentence classifier in this case. The model was able to generate realistic text after the adversarial training.
Besides the problem that gradients from the discriminator cannot properly back-propagate through discrete variables, deep generative models are also difficult to evaluate. Many solutions have been proposed over the recent years but these have not yet been standardized.
The hidden vectors accessed by the attention mechanism during the token generation phase represent the model’s “internal memory”. Neural networks can also be coupled with some form of memory to solve tasks such as visual QA, language modeling, POS tagging, and sentiment analysis. For example, to solve QA tasks, supporting facts or common sense knowledge are provided to the model as a form of memory. Dynamic memory networks, an improvement over previous memory-based models, employed neural networks models for input representation, attention, and answering mechanisms.
So far we now the capacity and effectiveness of neural-based models such as CNN and RNNs. We are also aware of the possibilities to apply reinforcement learning, unsupervised methods, and deep generative models to complex NLP tasks such as visual QA and machine translation. Attention mechanisms and memory-augmented networks are powerful at extending the capacity of neural-based NLP models. Combining all these powerful techniques provide compelling methods to understand the complexity of language. The next wave of language-based learning algorithms promises even greater abilities such as common sense and modeling complex human behaviors.
Reference: “Recent Trends in Deep Learning Based Natural Language Processing” — Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. IEEE Computational Intelligence Magazine, 2018.
Original. Reposted with permission.
- Detecting Sarcasm with Deep Convolutional Neural Networks
- Text Classification & Embeddings Visualization Using LSTMs, CNNs, and Pre-trained Word Vectors
- Emotion and Sentiment Analysis: A Practitioner’s Guide to NLP