Six Times Bigger than GPT-3: Inside Google’s TRILLION Parameter Switch Transformer Model

Google’s Switch Transformer model could be the next breakthrough in this area of deep learning.


I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:



OpenAI’s GPT-3 is, arguably , the most famous deep learning models created in the last few years. One of the things that impresses the most about GPT-3 is its size. In some context, GPT-3 is nothing but GPT-2 with a lot of more parameters. With 175 billion parameters, GPT-3 was about four times bigger than its largest predecessor.

Knowing that, how would you then feel about a model that is 6 times larger than GPT-3? This is precisely what a team from Google Research achieved with their novel Switch Transformer architecture. The new model features an unfathomable 1.6 trillion parameters which makes it effectively six times larger than GPT-3.

1.6 trillion parameters is certainly impressive but that’s not the most impressive contribution of the Switch Transformer architecture. With this new model, Google is essentially unveiling a method that maximize the parameter count of a transformer model in a simple and computationally efficient way. Transformer models like GPT-3 are not only huge but also computationally expensive which limits its adoption in mainstream scenarios.

The key modification of the Switch Transformer architecture is based on introducing a Mixture of Experts (MoE) routing layer that facilitates learning sparse models instead of a super large dense model. This is not as confusing as it reads so let me try to explain. Typical transformer architectures are composed by the famous attention layer followed by a dense feed forward network. Among other things, that dense layer is responsible for the large cost of training transformer models. Google’s Switch Transformer proposes replacing that layer with what they called a Switch FFN layer. That layer processes the input tokens and decides which smaller feed forward network should process it. The Switch FFN layer includes three main benefits:

  1. The router computation is very small as it only routes to a single expert.
  2. The capacity of each expert network can remain manageable.
  3. The router implementation is super simple.


With the new optimizations, Google was able to train a Switch Transformer model to an astonishing 1.6 trillion parameters! The training speed improved to up seven times compared to previous architectures.

Miraculously, the Switch Transformer release has managed to remain under the radar. Somehow, it reminds me of the original BERT paper that trigger the whole transformer movement. However, if the hype behind GPT-3 is any indication of what’s next to come, keep an eye for new milestones using the Switch Transformer.

Original. Reposted with permission.