Practical Speech Recognition with Python: The Basics

Do you fear implementing speech recognition in your Python apps? Read this tutorial for a simple approach to getting practical with speech recognition using open source Python libraries.

Have you ever wanted to try out a speech recognition project but found it all just too intimidating?

What about creating something a few steps beyond and a bit more complex, like a full audio chatbot or a voice assistant?

Putting together skeleton code for this type of a project is actually quite straightforward, thanks to a few open source libraries which we can lean on. With that in mind, let's have a look at how to start creating a basic toy speech recognition app with Python. Once we get the basics down we can discuss ways to make it much more useful.

Header image

Our toy Python app will be pretty useless, to be honest. But it will introduce us to a few concepts which will be useful for building more complex things afterwards. If we build this toy properly, modifying it to do anything more should be relatively painless. At least, to an extent.

Here's exactly what our app will do when we're done: it will listen to what we say and parrot it back to us. That's it! The pair of useful things we will take away are building speech recognition and audio playback into our app.

First, let's import the few libraries that we need:

import os

import speech_recognition as sr

from pydub import AudioSegment
from pydub.playback import play
from gtts import gTTS as tts


Here's the reasoning:

  • speech_recognition - "Library for performing speech recognition, with support for several engines and APIs, online and offline"
  • pydub - "Manipulate audio with a simple and easy high level interface"
  • gTTS - "Python library and CLI tool to interface with Google Translate's text-to-speech API"

The next thing to do — and likely most importantly for a speech recognition app — is to recognize speech. To do so, we'll need to first capture incoming audio from the microphone, and then perform the speech recognition. This is all handled via the speech_recognition library.

Here's a function to capture speech.

def capture():
    """Capture audio"""

    rec = sr.Recognizer()

    with sr.Microphone() as source:
        print('I\'M LISTENING...')
        audio = rec.listen(source, phrase_time_limit=5)

        text = rec.recognize_google(audio, language='en-US')
        return text

        speak('Sorry, I could not understand what you said.')
        return 0


That's it. Speech captured and recognized. Still think this is intimidating?

Note that once this app starts running, it will listen in 5 seconds intervals, and process these 5 second intervals one at a time. Practical? No, not really, but once we do something more complex we can tweak this to, perhaps, listen for an activation keyword, and then listen for the full duration of our speaking, regardless of length. However, this is a simple enough way to start.

So, what will we do after we capture speech? We'll process it. What exactly does this mean?

What kind of app you are building will largely determine what "process it" means. This time around, our processing will more or less be a placeholder function to do other things in the future. So for now, our toy app will process captured speech by parroting it back to us (and outputting it to the console, for good measure).

Here's a simple function for our processing.

def process_text(name, input):
    """Process what is said"""

    speak(name + ', you said: "' + input + '".')


We also want our app to speak, so let's write a function which uses the Google text-to-speech engine to accomplish this.

def speak(text):
    """Say something"""

    # Write output to console

    # Save audio file
    speech = tts(text=text, lang='en')
    speech_file = 'input.mp3'

    # Play audio file
    sound = AudioSegment.from_mp3(speech_file)


First we print out what was passed to the function to the console; then Google text-to-speech is used to create an audio file from the text; the audio file is saved to disk; and then the file is re-opened and played using the pydub library.

That's the "difficult" stuff taken care of. Now we just need a few lines to drive the process.

if __name__ == "__main__":

    # First get name
    speak('What is your name?')
    name = capture()
    speak('Hello, ' + name + '.')

    # Then just keep listening & responding
    while 1:
        speak('What do you have to say?')
        captured_text = capture().lower()

        if captured_text == 0:

        if 'quit' in str(captured_text):
            speak('OK, bye, ' + name + '.')

        # Process captured text
        process_text(name, captured_text)


Both the speak() and capture() functions are used to get the user's name when prompted, and then greet them. Then a while loop is entered which cycles between capturing speech input and performing some very elementary checks to ensure that something was captured and the user did not say 'quit' to exit. The captured text is passed to the process_text() function, which echoes what was said. This is then repeated ad infinitum.

I'll say it again: there isn't anything of much complexity going on here.

Save all of the above code to a file,

Now let's check out a conversation with our minimalist speech recognition app. Run it with the following line and see the results below (while imagining I'm talking and having my words repeated back to me, of course).

  $ python

What is your name?
Hello, Matthew.
What do you have to say?
Matthew, you said: "where are you from".
What do you have to say?
Matthew, you said: "i'd like some pizza".
What do you have to say?
Matthew, you said: "what is the meaning of life".
What do you have to say?
OK, bye, Matthew.

Process finished with exit code 0

Pretty cool. Of course, it could be way cooler if it actually did something. So let's turn our attention to that next.

For next time, let's ease into something more complex like integrating spaCy into our code and trying some simple NLP tasks, such as spoken sentence classification, sentiment analysis, and named entity recognition.

We can then look at something more practically useful such as making a personal voice assistant, which will require some additional tweaks to our interface. But one thing at a time...