How to Train a Speech Recognition Model with Wav2Vec 2.0 and Hugging Face Transformers

Let's learn how to train speech recognition model with Wav2Vec 2.0.



How to Train a Speech Recognition Model with Wav2Vec 2.0 and Hugging Face Transformers
Image by Editor | Midjourney

 

Let’s learn how to train the speech recognition model with Wav2Vec 2.0 and Transformers.

 

Preparation

 
Our tutorial would require the following packages, so install them with the following code:

pip install transformers datasets soundfile

 

Additionally, you should install the PyTorch package by selecting the suitable version for your environment.

With the package installed, we will get into the next part.
 

Train Speech Recognition Model with Wav2Vec 2.0

 
Speech Recognition is a machine learning model that translates spoken audio data into text format. Basically, the model can transcribe speech into a document. It’s a model that is increasingly popular in business, as there are many demands from them.

Wav2Vec 2.0 is a speech pre-trained model by Meta that can be fine-tuned. It’s a popular model for audio data that we would use for this tutorial.

In the beginning, I would use the open-source common voice dataset that contains speech audio files with their speech data (and other demographic data).

from datasets import load_dataset

dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train[:1%]", trust_remote_code=True)

 

Next, ensure the audio dataset we have follows the Wav2Vec 2.0 format.

from datasets import Audio

dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

def preprocess(batch):
    batch["input_values"] = batch["audio"]["array"]
    batch["input_length"] = len(batch["input_values"])
    return batch

dataset = dataset.map(preprocess, remove_columns=["audio"])

 

Then, we would download the Wav2Vec 2.0 model and the processor using Transformers.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")

 

With the model and processor ready, we would preprocess the sentence dataset and removing all the columns we don’t need.

def prepare_dataset(batch):
    batch["input_values"] = processor(batch["input_values"], sampling_rate=16_000).input_values[0]
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

dataset = dataset.map(prepare_dataset, remove_columns=["client_id", "path", "sentence", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "input_length"])

 

Don’t forget to split the dataset after that for training and test to have the dataset for evaluation.

train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

 

Additionally, I would create a data collator custom class for the training padding purpose.

import torch
from torch.nn.utils.rnn import pad_sequence

class CustomDataCollatorCTCWithPadding:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, features):
        input_features = [torch.tensor(feature["input_values"]) for feature in features]
        label_features = [torch.tensor(feature["labels"]) for feature in features]

        input_features_padded = pad_sequence(input_features, batch_first=True, padding_value=self.processor.feature_extractor.padding_value)
        labels_padded = pad_sequence(label_features, batch_first=True, padding_value=-100)
        attention_masks = torch.zeros_like(input_features_padded).masked_fill(input_features_padded != self.processor.feature_extractor.padding_value, 1)

        return {
            "input_values": input_features_padded,
            "labels": labels_padded,
            "attention_mask": attention_masks
        }

data_collator = CustomDataCollatorCTCWithPadding(processor=processor)

 

After that, we prepare the model configuration before training the speech recognition model. To speed up the training, I only use half-precision (fp16).

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./wav2vec2",
    group_by_length=True,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=1,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    warmup_steps=500,
    save_total_limit=2,
)

 

Lastly, we would train the model.

from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

trainer.train()

 

If you want to evaluate your model, you can use the following code.

metrics = trainer.evaluate()

 

That’s all about training Wav2Vec 2.0 for Speech Recognition. To improve your skills, try mastering the training process.

 

Additional Resources

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!