Silver BlogBERT is changing the NLP landscape

BERT is changing the NLP landscape and making chatbots much smarter by enabling computers to better understand speech and respond intelligently in real-time.

By Phillip Green, Informatics4AI.

Last year, I was worried that conversational AI would never shed its dunce cap. Today I am happy to report that major strides are being made and NPL is on the cusp of huge change.

NLP’s Evolution from Dumb to Smart

First, to understand why things are changing so fast, we need a quick review of NLP’s history. Before the 1980s, most NLP systems were rules-based and grounded by the work of Noam Chomsky who believed that the rules of grammar (transformational-generative grammar) could be used to understand semantic relations and thus lead machines to an understanding of speech. However, in the late ’80s, machine learning algorithms become increasing popular, and the shift from rules to statistical models began. The next big NLP leap took place in 2013 with the introduction of word embeddings such as Word2vec, GloVe, and FastText. Word embeddings attempt to encapsulate the “meaning” of a word in a vector after reading massive amounts of text and analyzing how each word appears in various contexts across a dataset. The idea is that words with similar meaning will have similar vectors. The biggest drawback with these first-generation word embeddings was that each word had only one vector when it can, in fact, have multiple meanings (for example Mercury is a planet, a metal, a car, or a Roman god). These drawbacks are a result of the fact that early word embedding models train with a small neural network (shallow training) for efficiency reasons. However, with Google’s release of BERT we are indeed at an inflection point.

What Makes BERT so Amazing?

Three things:

  1. BERT is a contextual model, which means that word embeddings are generated based on the context of the word’s use in a sentence, and thus a single word can have multiple embeddings. For example, BERT would produce different embeddings for Mercury in the following two sentences: “Mercury is visible in the night sky” and “Mercury is often confused with Hermes, the fleet-footed messenger of Greek gods.”
  2. BERT enables transfer learning. This is referred to as “NLP’s ImageNet Moment.” Google has pre-trained BERT on Wikipedia, and this pre-trained model can now be used on other more specific datasets like a customer support bot for your company. And remember this pre-training is expensive, which you can now skip. So, your starting point is a smart model (trained on general human speech) not just an algorithm in need of training.
  3. BERT can be fine-tuned cheaply and quickly on a small set of domain-specific data and will yield more accurate results than by training on these same domain-specific datasets from scratch.

Holy smokes!

BERT – Coming to an Application Near You

But what’s even more exciting, while lots of changes in AI and machine learning are happening behind the scenes, much of this next generation NLP is already being used in consumer products that you and I use every day.

If you use Gmail, you know exactly what I am talking about.

  • Suggested replies to emails – BERT.
  • Suggestions for the next word in the sentence – BERT.

I love these capabilities in Gmail, and BERT is now being utilized in many conversational AI applications. So, your chatbot should be getting smarter.

Data is Still King

Note that two critical elements enabled Google to build BERT. The first is Google’s trove of data and its ability to continuously refine BERT. Let’s go back to the Gmail example of auto-suggesting the next word. Every time you accept a suggestion and use that word, you are training the model. Every time you keep typing and use a different word from the suggestion, you are training the model. If using Gmail makes Google the smartest BERT practitioner on the planet, how can the little guy/gal ever catch up?

Moore’s Law is Alive and Well

The second key element enabling advances such as BERT is the continuing increase in the speed and capability of computers, especially NVIDIA’s GPUs and Google’s TPUs. Remember early word embedding models had to be highly efficient due to the state and cost of computing, BERT is far less efficient, but computing power has more than caught up. In fact, NVIDIA has just announced that they are supporting BERT and now claim that their AI platform has the fastest BERT training capabilities available. In addition, they claim to achieve very fast predictions (responses), which are needed in real-time chat applications. In fact, they have created the Inception Program to help conversational AI startups.

In closing

BERT, and models like it, are game-changers in NLP. Computers can better understand speech and respond intelligently in real-time. If NLP’s dunce cap hasn’t been fully swept away, it will be soon.

What are your thoughts on the state of NLP and BERT?


Bio: Phillip Green is the founder and CEO of Informatics4AI. He has decades of experience in knowledge management, search, and artificial intelligence. Informatics4AI helps customers improve the performance of search systems and AI models built on text datasets.  He can be reached at