Around the World in 60 Days: Getting Deep Speech to Work in Mandarin
Baidu continues to make impressive gains with deep learning. Their latest achievement centers on Mandarin speech recognition, which you can read about here from the researchers involved in the project.
3.1. The resulting system is more accurate than humans on this data
Despite the lack of hand tuned features or language specific components, our best Mandarin Chinese speech system transcribes short voice-query like utterances better than a typical Mandarin Chinese speaker. To benchmark against humans we ran a test with 100 randomly selected utterances and had a committee of five humans label all of them. The human committee had an error rate of 4.0% as compared to the speech system's performance of 3.7%. We also compared a single human transcriber to the speech system on 250 randomly selected utterances. In this case the speech system performs much better: 5.7% for the speech model compared to 9.7% for the human transcriber .
The fact that the network is better than a human on this task is important because it suggests directions for future research. We may be nearing the limits of improvement on this particular data, but humans will still almost certainly outperform Deep Speech on a broader set of speech data with different noise characteristics. This is especially true if humans are allowed to exploit contextual information. This suggests that work on the robustness of speech recognition systems to changes in background noise and the integration of more contextual information will be important in the future.
3.2. What worked in one language worked in the other
While working on Deep Speech 2, we explored architectures with up to 11 layers including many bidirectional recurrent layers and convolutional layers, as well as a variety of optimization and systems improvements. All of these techniques are discussed in detail in our paper .
An important pattern developed during our exploration: both the architecture and system improvements generalized across languages. Improvements in one language nearly always resulted in improvements in the other. Examples of this trend can be seen in tables 1 and 2 taken from our paper . This means that even though we explored a variety of different architectures, system improvements, and optimization tricks while working in Mandarin, these improvements were to speech recognition in general, rather than ones specific to Mandarin. Given the large differences between English and Mandarin, this suggests that these improvements would hold for other languages as well.
|Language||Architecture||Dev no LM||Dev LM|
|English (WER)||5-layer, 1 RNN||27.79||14.39|
|English (WER)||9-layer, 7 RNN||14.93||9.52|
|Mandarin (CER)||5-layer, 1 RNN||9.80||7.13|
|Mandarin (CER)||9-layer, 7 RNN||7.55||5.81|
|Language||Architecture||CPU CTC Time||GPU CTC Time||Speedup|
|English||5-layer, 3 RNN||5888.12||203.56||28.9|
|Mandarin||5-layer, 3 RNN||1688.01||135.05||12.5|
3.3. More data and bigger networks outperform feature engineering, but they also make it easier to change domains
It is a well-worn adage in the deep learning community at this point that a lot of data and a machine learning technique that can exploit that data tends to work better than almost any amount of careful feature engineering . We find the same thing here, with deeper models working increasingly well. However, the lack of feature engineering in end-to-end deep learning has other advantages. A big of advantage Deep Speech is that we need little domain specific knowledge to get the system to work in a new language.
3.4. The limiting factor is data
Time spent on a machine-learning problem roughly falls into the following three categories: getting data, developing algorithms, and training models. One of the reasons deep learning has been so valuable is that it has converted researcher time spent on hand engineering features to computer time spent on training networks. The end-to-end learning approach for speech recognition further reduces researcher time. GPUs have added so much value because they have reduced the training time. The systems work our team has done to speed up neural network training has further reduced that. We can now train a model on 10,000 hours of speech in around 100 hours on a single 8 GPU node. That much data seems to be sufficient to push the state of the art on other languages. There are currently about 13 languages with more than one hundred million speakers. Therefore we could produce a near state-of-the-art speech recognition system for every language with greater than one hundred million users in about 60 days on a single node.
Collecting such data sets could be very difficult and prohibitively expensive. However, our results suggest the existence of a universal architecture for speech recognition for all languages. If this is true, technologies like transfer learning will become an even more important research direction to recognize all the world’s languages.
 D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
 L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (2):179–190, 1983.
 S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE Transactions on, 28(4):357–366, 1980.
 A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369–376. ACM, 2006.
 A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12, 2009.
 A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deepspeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 F. H. Jackson and M. A. Kaplan. Lessons learned from fifty years of theory and practice in government language teaching. GEORGETOWN UNIVERSITY ROUND TABLE ON LANGUAGES AND LINGUISTICS 1999, page 71, 2001.
 L. Lame and G. Adda. On designing pronunciation lexicons for large vocabulary continuous speech recognition. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 1, pages 6–9. IEEE, 1996.
 L. Lamel, J.-L. Gauvain, V. B. Le, I. Oparin, and S. Meng. Improved models for mandarin speech-to-text transcription. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4660–4663. IEEE, 2011.
 J. Norman. Chinese. Cambridge University Press, 1988.
 J. Shan, G. Wu, Z. Hu, X. Tang, M. Jansche, and P. J. Moreno. Search by voice in mandarin chinese. In INTERSPEECH, pages 354–357, 2010.
Dr. Ryan Prenger is a senior research scientist at Baidu’s Silicon Valley Artificial Intelligence Laboratory (SVAIL). He received his Ph.D. degree in Physics from the University of California, Berkeley in 2008 after working primarily on neural network based machine learning with data from the visual system. He moved to SVAIL in 2014 where he helped build the original “Deep Speech” speech recognition engine and started the initial work on Deep Speech in Mandarin.
Dr. Tony Han is a senior research scientist and a senior manager in Baidu AI lab at Silicon Valley. He is currently leading the Mandarin Speech recognition team. He was the one of the original contributors of the deep learning based Mandarin speech recognition engine (DeepSpeech2). He is currently on leave from his academia duty as an Associate Professor of Electrical & Computer Engineering at the University of Missouri (MU).
Original. Reposted with permission.