How to Train a Speech Recognition Model with Wav2Vec 2.0 and Hugging Face Transformers
Let's learn how to train speech recognition model with Wav2Vec 2.0.

Image by Editor | Midjourney
Let’s learn how to train the speech recognition model with Wav2Vec 2.0 and Transformers.
Preparation
Our tutorial would require the following packages, so install them with the following code:
pip install transformers datasets soundfile
Additionally, you should install the PyTorch package by selecting the suitable version for your environment.
With the package installed, we will get into the next part.
Train Speech Recognition Model with Wav2Vec 2.0
Speech Recognition is a machine learning model that translates spoken audio data into text format. Basically, the model can transcribe speech into a document. It’s a model that is increasingly popular in business, as there are many demands from them.
Wav2Vec 2.0 is a speech pre-trained model by Meta that can be fine-tuned. It’s a popular model for audio data that we would use for this tutorial.
In the beginning, I would use the open-source common voice dataset that contains speech audio files with their speech data (and other demographic data).
from datasets import load_dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train[:1%]", trust_remote_code=True)
Next, ensure the audio dataset we have follows the Wav2Vec 2.0 format.
from datasets import Audio
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
def preprocess(batch):
batch["input_values"] = batch["audio"]["array"]
batch["input_length"] = len(batch["input_values"])
return batch
dataset = dataset.map(preprocess, remove_columns=["audio"])
Then, we would download the Wav2Vec 2.0 model and the processor using Transformers.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
With the model and processor ready, we would preprocess the sentence dataset and removing all the columns we don’t need.
def prepare_dataset(batch):
batch["input_values"] = processor(batch["input_values"], sampling_rate=16_000).input_values[0]
with processor.as_target_processor():
batch["labels"] = processor(batch["sentence"]).input_ids
return batch
dataset = dataset.map(prepare_dataset, remove_columns=["client_id", "path", "sentence", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "input_length"])
Don’t forget to split the dataset after that for training and test to have the dataset for evaluation.
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]
Additionally, I would create a data collator custom class for the training padding purpose.
import torch
from torch.nn.utils.rnn import pad_sequence
class CustomDataCollatorCTCWithPadding:
def __init__(self, processor):
self.processor = processor
def __call__(self, features):
input_features = [torch.tensor(feature["input_values"]) for feature in features]
label_features = [torch.tensor(feature["labels"]) for feature in features]
input_features_padded = pad_sequence(input_features, batch_first=True, padding_value=self.processor.feature_extractor.padding_value)
labels_padded = pad_sequence(label_features, batch_first=True, padding_value=-100)
attention_masks = torch.zeros_like(input_features_padded).masked_fill(input_features_padded != self.processor.feature_extractor.padding_value, 1)
return {
"input_values": input_features_padded,
"labels": labels_padded,
"attention_mask": attention_masks
}
data_collator = CustomDataCollatorCTCWithPadding(processor=processor)
After that, we prepare the model configuration before training the speech recognition model. To speed up the training, I only use half-precision (fp16).
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./wav2vec2",
group_by_length=True,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=1,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2,
)
Lastly, we would train the model.
from transformers import Trainer
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=processor.feature_extractor,
)
trainer.train()
If you want to evaluate your model, you can use the following code.
metrics = trainer.evaluate()
That’s all about training Wav2Vec 2.0 for Speech Recognition. To improve your skills, try mastering the training process.
Additional Resources
- Speech to Text with Wav2Vec 2.0
- 7 AI Portfolio Projects to Boost the Resume
- The Evolution of Speech Recognition Metrics
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.