Building an Automatic Speech Recognition System with PyTorch & Hugging Face

Check out this step-by-step guide to building a speech-to-text system with PyTorch & Hugging Face.



Building an Automatic Speech Recognition (ASR) System with PyTorch & Hugging Face
Image by Author

 

Automatic speech recognition (ASR) is a crucial technology in many applications, from voice assistants to transcription services. In this tutorial, we aim to build an ASR pipeline capable of transcribing speech into text using pre-trained models from Hugging Face. We will use a lightweight dataset for efficiency and employ Wav2Vec2, a powerful self-supervised model for speech recognition.

Our system will:

  1. Load and preprocess a speech dataset
  2. Fine-tune a pre-trained Wav2Vec2 model
  3. Evaluate the model’s performance using word error rate (WER)
  4. Deploy the model for real-time speech-to-text inference

To keep our model lightweight and efficient, we will use a small speech dataset rather than large datasets like Common Voice.

 

Step 1: Installing Dependencies

 
Before we start, we need to install the necessary libraries. These libraries will allow us to load datasets, process audio files, and fine-tune our model.

pip install torch torchaudio transformers datasets soundfile jiwer

 

The main purpose for the following libraries:

  1. transformers: Provides pre-trained Wav2Vec2 models for speech recognition
  2. datasets: Loads and processes speech datasets
  3. torchaudio: Handles audio processing and manipulation
  4. soundfile: Reads and writes .wav files
  5. jiwer: Computes the WER for evaluating ASR performance

 

Step 2: Loading a Lightweight Speech Dataset

 
Instead of using large datasets like Common Voice, we use SUPERB KS, a small dataset ideal for quick experimentation. This dataset consists of short spoken commands like “yes,” “no,” and “stop.”

from datasets import load_dataset

dataset = load_dataset("superb", "ks", split="train[:1%]")  # Load only 1% of the data for quick testing
print(dataset)

 

This loads a tiny subset of the dataset to reduce computational cost while still allowing us to fine-tune the model. Warning: the dataset still requires storage space, so be mindful of disk usage when working with larger splits.

 

Step 3: Preprocessing the Audio Data

 
To train our ASR model, we need to ensure that the audio data is in the correct format. The Wav2Vec2 model requires:

  1. 16 kHz sample rate
  2. No padding or truncation (handled dynamically)

We define a function to process the audio and extract relevant features.

import torchaudio

def preprocess_audio(batch):
    speech_array, sampling_rate = torchaudio.load(batch["audio"]["path"])
    batch["speech"] = speech_array.squeeze().numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["label"]  # Use labels as text output
    return batch

dataset = dataset.map(preprocess_audio)

 

This ensures all audio files are loaded correctly and formatted properly for further processing.

 

Step 4: Loading a Pre-trained Wav2Vec2 Model

 
We use a pre-trained Wav2Vec2 model from Hugging Face’s model hub. This model has already been trained on a large dataset and can be fine-tuned for our specific task.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

 

Here we define both the processor that converts raw audio into model-friendly features and the model, consisting of a Wav2Vec2 pre-trained on 960 hours of speech.

 

Step 5: Preparing Data for the Model

 
We must tokenize and encode the audio so that the model can understand it.

def preprocess_for_model(batch):
    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
    batch["input_values"] = inputs.input_values[0]
    return batch

dataset = dataset.map(preprocess_for_model, remove_columns=["speech", "sampling_rate", "audio"])

 

This step ensures that our dataset is compatible with the Wav2Vec2 model.

 

Step 6: Defining Training Arguments

 
Before training, we need to set up our training configuration. This includes batch size, learning rate, and optimization steps.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./wav2vec2",
    per_device_train_batch_size=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=1e-4,
    warmup_steps=500,
    max_steps=4000,
    save_total_limit=2,
    gradient_accumulation_steps=2,
    fp16=True,
    push_to_hub=False,
)

 

Step 7: Training the Model

 
Using Hugging Face’s Trainer, we fine-tune our Wav2Vec2 model.

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=processor,
)

trainer.train()

 

Step 8: Evaluating the Model

 
To measure how well our model transcribes speech, we compute the WER.

import torch
from jiwer import wer

def transcribe(batch):
    inputs = processor(batch["input_values"], return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    batch["predicted_text"] = processor.batch_decode(predicted_ids)[0]
    return batch

results = dataset.map(transcribe)
wer_score = wer(results["target_text"], results["predicted_text"])
print(f"Word Error Rate: {wer_score:.2f}")

 

A lower WER score indicates better performance.

 

Step 9: Running Inference on New Audio

 
Finally, we can use our trained model to transcribe real-world speech.

import torchaudio
import soundfile as sf

speech_array, sampling_rate = torchaudio.load("example.wav")
inputs = processor(speech_array.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

 

Conclusion

 
And that's it. You’ve successfully built an ASR system using PyTorch & Hugging Face with a lightweight dataset.
 
 

Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!