The Evolution of Speech Recognition Metrics

As Speech Recognition has become more accurate than ever, scenarios like dictation and meeting transcription are gaining popularity. Metrics need to evolve with the times and guide the research focus.

By Piyush Behre, Principal Applied Scientist at Microsoft on October 20, 2022 in Natural Language Processing

The Evolution of Speech Recognition Metrics

Image by pch.vector on Freepik

With the widespread adoption of deep learning, progress in Automatic Speech Recognition (ASR) accuracy has accelerated in the last few years. Researchers have been quick to claim human parity on specific measures. But is Speech Recognition really a solved problem today? If not, what we focus on measuring today is going to impact the direction ASR takes next.

Human Parity was the End Goal, but now we move the Goalpost!

ASR systems have for a long time focused on word error rate (WER) to measure accuracy. This metric makes sense. WER measures how many words the system gets wrong for every hundred words it recognizes. Because system performance can vary significantly depending on the scenario, audio quality, accent, etc., it is hard to put a single number as the end goal. (Of course, 0% would be nice). So, instead, we have typically used WER to compare two systems. And as the system started to get a lot better, we started comparing its WER with that of humans. Human parity was originally thought of as a distant goal, then deep learning accelerated things and we ended up “achieving” that a lot sooner.

If we are Comparing Machines Against Humans, Who are we Comparing Humans Against?

Human judges are first asked to transcribe audio. To generate transcription references, different judges do multiple passes of listening to the audio and editing to get more accurate transcriptions. Transcription is considered clean when multiple judges reach a consensus. Human baseline WER is measured by comparing a new judge’s transcription to that of the reference transcription. One judge can still be incorrect at times, when compared to a reference, in fact, typically getting one out of twenty words incorrect. This is equivalent to a WER of 5%.

To claim human parity, we typically compare the system against one judge. An interesting aspect of this is that ASR systems do try to do multiple passes themselves. Then is it fair to compare against one-person performance? It is indeed fair because we assume that the person also had enough time to replay the audio as needed when transcribing, effectively doing multiple passes.

Once ASR reached human parity (En-US) for certain Speech Recognition tasks, it quickly became apparent that for formally written form, getting one word wrong out of every twenty words is still a terrible experience. One way dictation products have tried tackling this issue is by trying to figure out which recognized words are low confidence and surfacing viable alternatives for them. We can observe this ‘correction’ experience in many commercial products today, including Microsoft Office Dictation and Google Docs (Fig. 1 and Fig. 2). However, there was another glaring issue with WER.

Figure 1: Microsoft Word Dictation showing alternates for dictated text

Figure 2: Google Docs showing alternates for dictated text

To keep things simple, traditionally WER wasn’t calculated on the final written form text. One could argue that the primary task of the ASR system was just to recognize the words correctly and not format entities like date, time, currency, email, etc. correctly. So instead of properly formatted text, the spoken form version was used for WER calculation. It canceled out any specific formatting differences, punctuation, capitalization, etc., and purely focused on the spoken words. This is an acceptable assumption if the use case is voice search, where task completion matters much more than text formatting. Searching with a voice could be made to work with just the ‘spoken form’. However, with a different use case like voice assistant, things started changing. Now the spoken form “wake me up at eight thirty-seven am” was much harder to handle compared to the written form “wake me up at 8:37 am”. The written form here is easier for the assistant to parse and turn into action.

Voice search and voice assistants are examples of what we call one-shot dictation use cases. As the ASR system became more reliable with one-shot dictation use cases, attention turned to dictation and conversation scenarios. These are both long-form speech recognition tasks. It was easy to ignore punctuation for voice search or voice assistant, but for any long-form dictation or conversation, not being able to punctuate is a dealbreaker. Since automatic punctuation models weren’t as good yet, one route that dictation took was to support ‘explicit punctuation’. You could explicitly say ‘period’ or ‘question mark’ and the system would do the right thing. It gave control to the users and ‘unblocked’ them to use dictation for writing emails or documents. Other aspects like capitalization or disfluency handling also started gaining importance for dictation scenarios. If we continued to rely on WER as the primary metric, we would be falsely painting the picture that our system has solved it. Our metric needed to evolve with the evolving use cases for Speech Recognition.

An obvious successor of Word Error Rate (WER) is Token Error Rate (TER). For TER, we try to consider all aspects of the written form like capitalization, punctuation, dysfluency, etc., and try to calculate a single metric just like WER (see Table 1). On the same sets where ASR had reached human parity on WER when re-evaluated with TER, ASR lost the human parity battle again. Our goalpost has moved, but this time it does feel like a real goalpost because the outcome is going to be much closer to the written form that is widely used.

Recognition

Reference

Metric

Spoken form

wake me pat eight thirty five a m

wake me up at eight twenty five a m

WER = 3/9 = 33.3%

Written form

Wake me pat 8:35 AM.

Wake me up at 8:25 AM.

TER = 3/7 = 42.9%

(Punctuation is counted separately)

Table 1: How does spoken form vs written form impact metrics computation?

The Problem with these Aggregate Metrics

TER is a good overall metric to look at, but it hides within all the crucial details. There are now many categories contributing to this one number, and it isn’t about just getting the words right. So, this metric by itself is less actionable. To make it more actionable, we need to figure out which category is contributing the most to it and decide to focus on improving that. There is a class imbalance issue as well. Less than 2% of all tokens would contain any numbers and number-related formats like time, currency, date, etc. Even if we get this category completely wrong it would still have a limited impact on TER depending on the TER baseline. But getting this 2% of the cases even wrong half the time would be a terrible experience for the users. For this reason, the TER metric alone cannot guide our research investments. I’d argue that we need to figure out important categories for our users and measure and improve more targeted metrics like category-F1.

How to Figure Out What’s Important for our Users?

Aha! This is the most important question for ASR. WER, TER, or category-F1 are all metrics for scientists to validate their progress, but still can be quite far off from what’s important for the users. So, what is important? To answer this question, we’ll have to go back to why users need an ASR system in the first place. This, of course, depends on the scenario. Let’s take the case of dictation first. The sole purpose of dictation is to be able to replace typing. Don’t we already have a metric for that? Words per minute (WPM) is already a well-established metric to measure typing efficiency. I’d argue this is the perfect metric for Dictation. If dictation users can achieve much higher WPM by dictating than typing, the ASR system has done its job. Of course, WPM here takes well into account how users need to go back and correct things that can slow them down. There would be some errors that are a must-fix for users, and then there are some errors that are acceptable. This naturally assigns a higher weight to what matters and may even disagree with TER which gives equal weight to all errors. Beautiful!

Can we use the same Logic for Conversation or Meeting Transcription?

The meeting transcription case is quite different from dictation, in its objective. Dictation is a human2machine use-case and meeting transcription is a human2human use-case. WPM stops being the right metric. However, if there is a purpose to generating transcriptions, the right metric lies around that purpose. For instance, the goal of a broadcast meeting might be to generate human-readable transcriptions at the end, so how many edits a human annotator have to make on top of the machine-generated transcript becomes a metric. This is like TER, except it doesn’t matter if some sections are misrecognized, what matters is that the result is coherent and flows well.

Another purpose could be to extract actionable insights from transcriptions or generate summaries. These are harder to measure to connect back to recognition accuracy. However, human interactions can still be measured, and a task completion or engagement-type metric is more suitable.

The metrics we discussed so far can be categorized as ‘online’ metrics and ‘offline’ metrics, as summarized in Table 2. In an ideal world, they move together. I’d argue that while offline metrics could be a good indicator of potential improvements, ‘online’ metrics are the real measure of success. When improved ASR models are ready, it is important to first measure the right offline metrics. Shipping them is only half the job done. The real test is passed when these models improve the ‘online’ metrics for the customers.

	Offline Metrics	Online Metrics
Voice Search	Spoken-form Word Error Rate (WER)	Successful Click on relevant search result
Voice Assistant	Token Error Rate (TER), Formatting F1 for time, date, phone numbers, etc.	Task Completion rate
Voice Typing (Dictation)	Token Error Rate (TER), Punctuation-F1, Capitalization-F1,	Words per minute (WPM), Edit rate, User retention/engagement
Meeting (Conversation)	Token Error Rate (TER), Punctuation-F1, Capitalization-F1, Disfluency-F1	Transcription edit rate, User engagement with Transcription UI

Table 2: Offline and Online metrics for Speech Recognition

References

"Achieving Human Parity in Conversational Speech Recognition - arXiv." 17 Oct. 2016, https://arxiv.org/abs/1610.05256.
Word Error Rate https://en.wikipedia.org/wiki/Word_error_rate
F1 score https://en.wikipedia.org/wiki/F-score
Words per minute (WPM) https://en.wikipedia.org/wiki/Words_per_minute

Piyush Behre is a Principal Applied Scientist at Microsoft working on Speech Recognition/Natural Language Processing. He received his Bachelor of Technology degree in Computer Science and Engineering from the Indian Institute of Technology, Roorkee.