Top Open Source Large Language Models

In this article, we will discuss the importance of large language models and suggest some of the top open source models and the NLP tasks they can be used for.

Sponsored Post

By Maziyar Panahi, Senior Data Scientist and Spark NLP Lead at John Snow Labs
Top Open Source Large Language Models

What is a Large Language Model?

A Language Model at the heart is just a probability distribution over sequences of tokens (words). The Language Models are the core of modern Natural Language Processing (NLP) and their applications can be for a variety of NLP tasks such as speech-to-text, sentiment analysis, text summarization, spell checking, token classification, etc. In most NLP tasks, the Language Models can determine the probability of the next token by analyzing the given text. The Language Model can be in the form of Unigrams, N-grams, Exponential, or Neural networks.
In 2019, there was a big boost in the popularity of Language Modelling thanks to the development of transformers like BERT, GPT-2, and XLM. These transformer-based models can be adapted from a general-purpose language model to a specific downstream task which is known as fine-tuning. The process of fine-tuning requires much fewer data than training the language model from scratch. That’s one of the reasons that makes transformer-based modes remarkable compared to previous approaches used in Language Modelling.
Another reason for Language Models being so remarkable is that a single model can be used for multiple downstream NLP tasks such as question answering, token and text classification, document summarization, text generation, translation, and many more. The transformer-based language models can perform these NLP tasks with much higher performance thanks to having more parameters and more training data.

This post includes some of the top open source language models that speakers from the NLP Summit 2022 and our team at John Snow Labs find particularly useful thanks to their advanced architectures allowing us to achieve state-of-the-art benchmarks for the following NLP downstream tasks: Token and Text Classifications, Question Answering, Translation, Summarization, and Text generation.


1.  GPT-Neo, GPT-J, and GPT-NeoX

GPT-Neo, GPT-J, and GPT-NeoX are very powerful AI models and can be used for Few-shot learning problems. Few-shot learning is like training/fine-tuning any deep learning model, however, it only needs a limited number of samples.

The GPT-Neo, GPT-J, and GPT-NeoX models were trained and released by EleutherAI as open-source versions of GPT-3 which was released by Open-AI and stayed private to this day. The GPT-J and GPT-Neo models are similar to GPT-2 and they were both trained on the Pile dataset. The Pile is an 825 GiB open-source language modeling data set that consists of 22 smaller datasets combined. The importance of Pile is the diversity in its data sources that improves general cross-domain knowledge as well as downstream NLP tasks. 

GPT-NeoX is an improvement of previously released open-source GPT models primarily based on Megatron-LM and DeepSeed. Due to the complexity and its size, it was constructed on Mesh TensorFlow and designed for GPUs. The GPT-NeoX-20B model has 20 billion parameters and it was trained on the Pile which makes it the largest dense autoregressive model that has been publicly available. GPT-NeoX-20B can help develop proofs-of-concept for measuring the feasibility of the project thanks to the few-shot learning.


2.  XLNet

The researchers at Carnegie Mellon University and Google developed a new model called XLNet to perform NLP tasks such as reading comprehension, text classification, sentiment analysis, and others. Its autoregressive formulation enables learning bidirectional contexts by maximizing the likelihood over all permutations of the factorization order and overcomes the limitations of BERT. It follows a generalized autoregressive pre-training method.
Moreover, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, mostly by a large margin, and achieves state-of-the-art results on 18 tasks such as question answering, natural language inference, sentiment analysis, and document ranking.


3.  Roberta

Researchers at Facebook AI and the University of Washington analyzed how Google’s Bidirectional Encoder Representation from Transformers (BERT) was trained. They have made several changes to the training process and they have also enhanced the performance. In addition, the researchers used a larger dataset for training, chose larger mini-batches, removed the Next Sentence Prediction (NSP), and trained the model with far more iterations than BERT. This resulted in an optimized model called RoBERTa (Robustly Optimized BERT Approach) that matched scores from XLNet model on the GLUE (General Language Understanding Evaluation) benchmark.
Transfer learning in NLP has been proven to be highly efficient for Text Classification tasks. The RoBERTa models achieve competitive accuracy across a wide range of downstream tasks which has made it a “Go-to” model for both Token and Text classification tasks by many companies.


4.  DeBERTa

The researchers at Microsoft Research proposed decoding-enhanced BERT with disentangled attention to improve the both BERT and RoBERTa models by using two techniques. Firstly, it disentangled the attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Secondly, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. Tanya points out that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks.
The DeBERTa model surpasses the human baseline on the GLUE benchmark for the first time at the time of publication. To this day the DeBERTa models are mainly used for a variety of NLP tasks such as question-answering, summarization, and token and text classification.



In today’s business world, most successful companies have reached the point where they must offer their services in other languages than English. Thanks to the researchers at the Facebook AI team, the XLM-RoBERTa is a transformer-based language model and it is capable of processing text from 100 different languages.

In the past, this required multiplying the efforts for each new language, dealing with the intricate details of each language. While they often don't provide the best performance per task, multilingual models like XLM-RoBERTa allow businesses to generate value for non-English speaking users much faster.


6.  DistilBERT

While other models aim to optimize BERT’s performance, DistilBERT has a different aim. While XLNet, RoBERTa, and DeBERTa made improvements on the performance, DistilBERT tries to improve the inference speed. Its target is to reduce the large size and enhance the speed of BERT_{BASE} and BERT_{LARGE} having 110M and 340M parameters, respectively, while still keeping as much power as possible. DistilBERT reduces the size of BERT_{BASE} by 40% and enhances the speed by 60% while retaining 97% of its capabilities.

Join the world’s largest applied NLP community at the NLP Summit 2022 from October 4-6, 2022 to learn more about the top large language models and how they are used to solve business problems. The virtual event features three days of immersive, industry-focused content in over 50 technical sessions, including the following talks related to large language models:

  • Deploying BLOOM: A 176B Parameter Multi-Lingual Large Language Model – hear more about the world’s largest open-source large language model, presented by the Hugging Face team.
  • “Demystifying Large Language Models: How Transformers can be Applied in Practice” – by Stella Biderman, Lead Scientist at Booz Allen Hamilton, about the new open-source large language models built by Eleuther AI like GPT-NeoX-20B
  • Sparse Expert Models: Past and Future – about recent work at Google Brain and OpenAI to build computationally efficient language models

Click here to see the full program and register for free!