Talking Machine – Deep Learning in Speech Recognition

A summary about an episode on the talking machine about deep neural networks in speech recognition given by George Dahl, who is one of Geoffrey Hinton’s students and just defended his Ph.D last month.

The Talking Machines is a series of podcasts about machine learning by Katherine Gorman, a journalist and Ryan Adams, an Assistant Professor of Computer Science at Harvard. Here is my previous post on talking machine interviews with Geoffrey Hinton, Yoshua Bengio, and Yann LeCun.


Episode 9: “Starting Simple and Machine Learning in Meds” podcast is a talk with George Dahl. He is using deep neural networks in speech recognition.

Start of the adventure

“In 2009, there is a workshop on deep learning for speech at NIPS, we sent our work to the workshop.”

It turned out that the entire workshop was about this paper “Deep Belief Networks for phone recognition”, speakers commenting on it and people arguing whether it would work. Although most of people were not really aware of its value, Li Deng from Microsoft was excited about George’s work, since he was familiar with TIMIT and knew what the result meant.


After George was invited to Microsoft for an internship, he and his colleagues had the first exciting results of their approach on a large vocabulary test. They basically changed the game of speech recognition. “The speech community generally reports relative error reduction because they have been refining the same basic approach for years, and sometimes it is very hard to get much improvement if you don’t throw away the highly refined system. Because a lot of people in the industry are not doing as much open-ended longer term research, they stuck with the Gaussian Mixture Model, Hidden Markov Model, N-gram language models.”


The basic recipe they used was to train a deep neural net. They replaced the Gaussian Mixture Model and essentially learnt the acoustic associated with the elementary speech sound in the phones. The model, deep neural net replacing GMM, was developed in 90s. “We try to make that much deeper and still very wide.”

The first layer of their model is Gaussian Bernoulli Restricted Boltzmann Machines, which is quite trainable. “Pre-training was helpful for our results and is still helpful on some speech text. You can do other things, known as supervised pre-training or just use large amounts of data. Our dataset was not that much.”

Present & Future

Are people still publishing results on Gaussian Mixture Model?

When talking about this question, George explained that it is almost impossible to delete steps based on the way speech recognition works. Actually, the first step is still to train a Gaussian Mixture Model, to create training data for neural net. “The people, who are working in core recognition technology, I think, are all using neural nets now. Some of them are using neural language models as well. That’s sort of parallel development. The next step will be recurrent neural network to replace HMM, which eventually will spread to everyone.”


George is always searching for great dataset. So when Merck released “an unusually interesting dataset and a difficult problem”, George led a team in this Kaggle challenge and eventually won. The dataset it quite small compared to the usual datasets which deep neural networks applied to.  When we have a few thousand training cases, a model with a hundred million tunable parameters is not going to work. “Maybe it is time to test the limits of our methods in the other direction. How small can we go? How scarce can the data be while we can still use the powerful model?”, George said.

One common thing from the two talking machine interviews I wrote about is that, all these researchers mentioned “in order to do something new, you have to challenge the way things are being done.” It won’t be easy apparently. Just as George said, “you need all the researchers to see the result and the improvement needs to be big enough get the community’s attention.”