How to Implement Cross-Lingual Transfer Learning with mBERT in Hugging Face Transformers

Let's learn how to use mBERT for multilingual tasks.



How to Implement Cross-Lingual Transfer Learning with mBERT in Hugging Face Transformers
Image by Editor | Ideogram

 

Let’s learn how to use mBERT from Hugging Face Transformers for Cross-Lingual Transfer Learning.

 

Preparation

 
You must install the package below for this tutorial, so use the provided code.

pip install transformers datasets

 

Then, you must install the PyTorch package, which could work in your environment.

With the package installed, we will get into the next part.

 

Cross-Lingual Transfer Learning with mBERT

 
You may already know the BERT model, one of the first language models for understanding human language, which has been used in many language-related tasks. mBERT is a unique BERT that has been trained in 104 different languages. That makes the mBERT model capable of understanding different languages while training in another language.

Let's understand mBERT's capabilities with this tutorial for cross-lingual tasks. We would go through with fine-tuning mBERT in English and applying it to classification tasks in French.

First, we would download the dataset in English and preprocess it.

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch

#Using XNLI dataset
dataset = load_dataset('xnli', 'en')
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')


def tokenize_function(examples):
    premise = [ex if isinstance(ex, str) else " ".join(ex) for ex in examples['premise']]
    hypothesis = [ex if isinstance(ex, str) else " ".join(ex) for ex in examples['hypothesis']]
   
    return tokenizer(premise, hypothesis, padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

 

For the sake of a quick training process, I would only use the subset of the dataset.

import random

random.seed(42)

train_indices = random.sample(range(len(tokenized_datasets['train'])), 1000)
val_indices = random.sample(range(len(tokenized_datasets['validation'])), 500)

train_dataset = tokenized_datasets['train'].select(train_indices)
val_dataset = tokenized_datasets['validation'].select(val_indices)

 

Then, we would download the mBERT model.

model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=3)

 

Once the model is ready, we will fine-tune mBERT with the English dataset.

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

 

With the model ready, we would evaluate them against the French language dataset instead of the English one.

french_dataset = load_dataset('xnli', 'fr')

tokenized_french_dataset = french_dataset.map(tokenize_function, batched=True)
tokenized_french_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

french_val_dataset = tokenized_french_dataset['validation']

results = trainer.evaluate(french_val_dataset)
print(results)

 

Output>>
{'eval_loss': 1.0408061742782593, 'eval_runtime': 9.4173, 'eval_samples_per_second': 264.406, 'eval_steps_per_second': 16.565, 'epoch': 3.0}

 

The result seems promising, and the model can generalize well into another language into which it has yet to be trained.

Master the mBERT model to handle tasks involving multiple languages.

 

Additional Resources

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.