Must Read NLP Papers from the Last 12 Months

The era of large language models is here now.

Must Read NLP Papers from the Last 12 Months
Photo by Anil Sharma on Pexels


Since the groundbreaking release of BERT in October 2018, machine learning has achieved ever greater heights through clever optimization and augmented compute. BERT, which stands for Bidirectional Encoder Representations from Transformers, introduced a new paradigm in neural network architecture. The transformer has served as a significant unlock in machine learning capabilities.

Further advancements in the field of Natural Language Processing (NLP) have improved foreign language translation, enhanced no-code applications, increased the fluency of chatbots, and very quickly set new standards for an array of state-of-the art benchmarks.

Alongside these remarkable accomplishments, the development of large language models (LLMs) has not been without controversy. In the 2021 "Stochastic Parrots" paper, a team of researchers including machine learning engineer and ethicist Timnit Gebru criticized these models for:

  • Levying a damning environmental cost
  • Excluding marginalized voices through inelegant curation of the training data set
  • Plagiarizing internet content and stealing from human writers

Gebru was summarily fired from her position on Google's Ethical Artificial Intelligence Team.


In this writeup


We explore four NLP papers published in the past year that represent the latest advancements. Understanding these developments will improve your capabilities as a Data Scientist and put you at the forefront of this dynamic research space.


1. Training Compute Optimal Large Language Models


This paper examines the ideal model size and token count for a language model using the transformer architecture. It aims to answer the question of what constitutes the ideal number of parameters and size of dataset for a model trained under a predetermined compute budget.

The researchers found that in prior cases, LLMs seem to have been severely undertrained. The authors criticize these teams for overemphasizing the scaling of compute resources while underemphasizing the importance of training data volume.

The authors concluded that for compute-optimal training, model size and the number of training tokens should be scaled equally. In other words,

for every doubling of model size, the number of training tokens should also be doubled.

The research showed that a relatively small model (70B parameters) trained on 4 times more training data could consistently beat larger models (up to 530B parameters) at state-of-the-art benchmark tests such as Multi-task Language Understanding (MMLU).

The enhanced training data allows the smaller model to utilize significantly less compute resources for inference and fine-tuning. This bodes well for downstream utilization.

TL;DR — this paper shows that the prior understanding of scaling laws was incorrect. In fact, when trained with a properly extensive token count, smaller networks can be significantly better than larger ones.


2. Training Language Models to Follow Instructions with Human Feedback


Enhancing the compute provided to LLMs does not automatically improve their ability to interpret user intent. As a troubling consequence of this fact, LLMs may provide results that are untruthful or harmful.

This paper highlights a novel method for fine-tuning language models using human feedback to better align the output with user intent across a variety of tasks.

The researchers gathered a dataset starting from a collection of OpenAI API prompts. They then utilize the data to fine-tune GPT-3 via supervised learning. Then, using reinforcement learning based on user input, they generated a new dataset ranking model outputs. The researchers then used this data to further fine-tune the supervised model, resulting in a model they called InstructGPT.

Compared to the original GPT-3, InstructGPT has 100 times fewer parameters, and yet it is capable of outperforming GPT-3 in human assessments.

On test data, the InstructGPT model is more likely to respond honestly and less likely to create harmful content. Though InstructGPT still occasionally makes basic errors, these findings demonstrate that fine-tuning with a human-in-the-loop serves as a viable route for matching language models with human intent.

TL;DR — this paper shows that doing reinforcement learning with human feedback is an extremely helpful, low-resource way to make existing models more useful.


3. A Generalist Agent


This paper explores improvements resulting in a model capable of playing Atari, captioning pictures, generating text, stacking physical blocks using a robot arm, and much more.

The model, Gato, is composed of a single neural network with unchanged weights across assorted tasks.

Gato resulted from scaled up behavior cloning, a form of sequence modeling challenge. The challenge of encoding many modalities into a single vector space of tokens constituted the most significant barrier the researchers faced in their efforts. The study makes a number of advancements in tokenization of standard vision and language datasets. In addition, the researchers sought novel solutions to the typical sequence model problem of determining context window length.

TL;DR — this paper shows that multimodal models can very well and are likely the future of the modeling paradigm. In contrast to previous state-of-the-art models that were capable of performing only in a narrow area, Gato executes a generalist policy capable of a variety tasks and multiple modalities.


4. Large Language Models are Zero Shot Reasoners


LLMs are remarkable few-shot learners using narrow, task-specific examples. This research paper demonstrates that LLMs are also competent zero-shot reasoners, particularly when prompted with the phrase, "let’s think step by step."

Yes, you read that right.

Instructing an LLM to “think step by step” actually improves results enough to justify a paper.

The model created by authors Kojima et al. surpassed existing benchmarks on reasoning tasks, such as arithmetic (e.g., MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (e.g., Last Letter, Coin Flip), and logical reasoning (e.g., Date Understanding, Tracking Shuffled Objects).

The adaptability of this single prompt, "think step by step," over a wide range of reasoning tasks suggests that the zero-shot skills were previously significantly underutilized. Remarkably high-level, multi-task capabilities may be retrieved simply by employing a linguistic framing of the problem that requests a higher cognitive load.

My mind is blown.

TL;DR — this paper shows that the quality of a LLM's answer is largely dependent on the wording of the prompt




Machine learning has advanced significantly in the past four years. Only time will tell if this pace of development can be sustained.

These papers discuss the latest enhancements in NLP, revealing considerable room for continued improvement in training processes to involve larger datasets and human-in-the-loop reinforcement learning.

Recent research also explores the creation of multi-modal paradigms and enhanced zero-shot reasoning capabilities via simple alterations to the model’s input prompts.

Nicole Janeway Bills is the Community Organizer at Data Strategy Professionals. She offers a proven track record of training data practitioners to quickly and effectively ace the CDMP Exams. In her work as a Data Strategy consultant, Nicole has helped set up data collection, data storage, and data analytics functions. She applies best practices to solve clients’ most pressing challenges. Furthermore, she has worked as a Data Scientist and Project Manager for federal and commercial consulting teams. Her business experience includes natural language processing, cloud computing, statistical testing, pricing analysis, ETL processes, and web and application development.

Original. Reposted with permission.