How to Incorporate Tabular Data with HuggingFace Transformers

In real-world scenarios, we often encounter data that includes text and tabular features. Leveraging the latest advances for transformers, effectively handling situations with both data structures can increase performance in your models.

By Ken Gu, Applied Research Scientist Intern at Georgian.

Transformer-based models are a game-changer when it comes to using unstructured text data. As of September 2020, the top-performing models in the General Language Understanding Evaluation (GLUE) benchmark are all BERT transformer-based models. At Georgian, we find ourselves working with supporting tabular feature information as well as unstructured text data. We found that by using the tabular data in our models, we could further improve performance, so we set out to build a toolkit that makes it easier for others to do the same.

The 9 tasks that are part of the GLUE benchmark.


Building on Top of Transformers


The main benefits of using transformers are that they can learn long-range dependencies between text and can be trained in parallel (as opposed to sequence to sequence models), meaning they can be pre-trained on large amounts of data.

Given these advantages, BERT is now a staple model in many real-world applications. Likewise, with libraries such as HuggingFace Transformers, it’s easy to build high-performance transformer models on common NLP problems.

Transformer models using unstructured text data are well understood. However, in the real-world, text data is often supported by rich structured data or other unstructured data like audio or visual information. Each one of these might provide signals that one alone would not. We call these different ways of experiencing data — audio, visual, or text — modalities.

Think about e-commerce reviews as an example. In addition to the review text itself, we also have information about the seller, buyer, and product available as numerical and categorical features.

We set out to explore how we could use text and tabular data together to provide stronger signals in our projects. We started by exploring the field known as multimodal learning, which focuses on how to process different modalities in machine learning.


Multimodal Literature Review


The current models for multimodal learning mainly focus on learning from the sensory modalities such as audio, visual, and text.

Within multimodal learning, there are several branches of research. The MultiComp Lab at Carnegie Mellon University provides an excellent taxonomy. Our problem falls under what is known as Multimodal Fusion — joining information from two or more modalities to make a prediction.

As text data is our primary modality, our review focues on the literature that treats text as the main modality and introduces models that leverage the transformer architecture.

Trivial Solution to Structured Data

Before we dive into the literature, it’s worth mentioning that there is a simple solution that can be used where the structured data is treated as regular text and is appended to the standard text inputs. Taking the e-commerce reviews example, the input can be structured as follows: Review. Buyer Info. Seller Info. Numbers/Labels. Etc. One caveat with this approach, however, is that it is limited by the maximum token length that a transformer can handle.


Transformer on Images and Text


In the last couple of years, transformer extensions for image and text have really advanced. Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al. (2019) uses pre-trained ResNet and pre-trained BERT features on unimodal images and text, respectively, and feeds this into a Bidirectional transformer. The key innovation is adapting the image features as additional tokens to the transformer model.

An illustration of the multimodal transformer. This model takes the output of ResNet on subregions of the image as input image tokens.

Additionally, there are models — ViLBERT (Lu et al. 2019) and VLBert (Su et al. 2020) — that define pretraining tasks for images and text. Both models pre-train on the Conceptual Captions dataset, which contains roughly 3.3 million image-caption pairs (web images with captions from alt text). In both cases, for any given image, a pre-trained object detection model like Faster R-CNN obtains vector representations for regions of the image, which count as input token embeddings to the transformer model.

The VLBert model diagram. It takes image regions outputted by Faster R-CNN as input image tokens.

As an example, ViLBert pre-trains on the following training objectives:

  1. Masked multimodal modeling: Mask input image and word tokens. For the image, the model tries to predict a vector capturing image features for the corresponding image region, while for text, it predicts the masked text based on the textual and visual clues.
  2. Multimodal alignment: Whether the image and text pair are actually from the same image and caption pair.

The two pre-training tasks for ViLBert.

An example of masked multimodal learning. Given the image and text, if we mask out dog, then the model should be able to use the unmasked visual information to correctly predict the masked word to be dog.

All these models use the bidirectional transformer model that is the backbone of BERT. The differences are the pre-training tasks the models are trained on and slight additions to the transformer. In the case of ViLBERT, the authors also introduce a co-attention transformer layer (shown below) to define the attention mechanism between the modalities explicitly.

The standard transformer block vs. the co-attention transformer block. The co-attention block injects attention-weighted vectors of another modality (linguistic, for example) into the hidden representations of the current modality (visual).

Finally, there’s also LXMERT (Tan and Mohit 2019), another pre-trained transformer model that, as of Transformers version 3.1.0, is implemented as part of the library. The input to LXMERT is the same as ViLBERT and VLBERT. However, LXMERT pre-trains on aggregated datasets, which also include visual question answering datasets. In total, LXMERT pre-trains on 9.18 million image text pairs.


Transformers on Aligning Audio, Visual, and Text


Beyond transformers for combining image and text, there are multimodal models for audio, video, and text modalities in which there is a natural ground truth temporal alignment. Papers for this approach include MulT, Multimodal Transformer for Unaligned Multimodal Language Sequences (Tsai et al. 2019), and the Multimodal Adaptation Gate (MAG) from Integrating Multimodal Information in Large Pretrained Transformers (Rahman et al. 2020).

MuIT is similar to ViLBert in which co-attention is used between pairs of modalities. MAG, meanwhile, injects other modality information at certain transformer layers via a gating mechanism.


Transformers with Text and Knowledge Graph Embeddings


Some works have also identified knowledge graphs as a vital piece of information in addition to text data. Enriching BERT with Knowledge Graph Embeddings for Document Classification (Ostendorff et al. 2019) uses features from the author entities in the Wikidata knowledge graph in addition to metadata features for book category classification. In this case, the model is a simple concatenation of these features and BERT output text features of the book title and description before some final classification layers.

The simple model architecture to incorporate knowledge graph embeddings and tabular metadata.

On the other hand, ERNIE (Zhang et al. 2019) matches the tokens in the input text with entities in the knowledge graph. They fuse these embeddings to produce entity aware text embeddings and text-aware entity embeddings with this matching.


Key Takeaway


The main takeaway for adapting transformers for multimodal data is to ensure that there is an attention or weighting mechanism between the different modalities. These attention mechanisms can occur at different points of the transformer architecture, as encoded input embeddings, injected in the middle, or combined after the transformer encodes the text data.


Multimodal Transformers Toolkit


Using what we’ve learned from the literature review and the comprehensive HuggingFace library of state-of-the-art transformers, we’ve developed a toolkit. The multimodal-transformers package extends any HuggingFace transformer for tabular data. To see the code, documentation, and working examples, check out the project repo.

At a high level, the outputs of a transformer model on text data and tabular features containing categorical and numerical data are combined in a combining module. Since there is no alignment in our data, we choose to combine the text features after the transformer’s output. The combining module implements several methods for integrating the modalities, including attention and gating methods inspired by the literature survey. More details of these methods are available here.

High-level diagram of multimodal-transformers. The adaptation of transformers to incorporate data is all contained in the combining module.




Let’s work through an example where we classify clothing review recommendations. We’ll use a simplified version of the example included in the Colab notebook. We will use the Women’s E-Commerce Clothing Reviews from Kaggle, which contains 23,000 customer reviews.

A sample of the clothing review dataset.

In this dataset, we have text data in the Title and Review Text columns. We also have categorical features from the Clothing ID, Division Name, Department Name, and Class Name columns and numerical features from the Rating and Positive Feedback Count.


Loading The Dataset


We first load our data into a TorchTabularTextDataset, which works with PyTorch’s data loaders that include the text inputs for HuggingFace Transformers and our specified categorical feature columns and numerical feature columns. For this, we also need to load our HuggingFace tokenizer.


Loading Transformer with Tabular Model


Now we load our transformer with a tabular model. First, we specify our tabular configurations in a TabularConfig object. This config is then set as the tabular_config member variable of a HuggingFace transformer config object. Here, we also specify how we want to combine the tabular features with the text features. In this example, we will use a weighted sum method.

Once we have the tabular_config set, we can load the model using the same API as HuggingFace. See the documentation for the list of currently supported transformer models that include the tabular combination module.




For training, we can use HuggingFace’s trainer class. We also need to specify the training arguments, and in this case, we will use the default.

Let’s take a look at our models in training!

The Tensorboard logs from the above experiment. You can also check out this Tensorboard here.




Using this toolkit, we also ran our experiments on the Women’s E-Commerce Clothing Reviews dataset for recommendation prediction and the Melbourne Airbnb Open Data dataset for price prediction. The former is a classification task, while the latter is a regression task. Our results are in the table below. The text_only combine method is a baseline that uses only the transformer and is essentially the same as a HuggingFace forSequenceClassification model.

We can see that incorporating tabular features improves performance over the text_only method. The performance gains depend on how strong the training signals from the tabular data are. For example, in the review recommendation case, the text_only model is already a strong baseline.


Next Steps


We’ve already used the toolkit successfully in our projects. Feel free to try it out on your next machine learning project!

Check out the documentation and the included main script for how to do evaluation and inference. If you want support for your favorite transformer, feel free to add transformer support here.




Readers should check out The Illustrated Transformer and The Illustrated BERT for a well-summarized overview of transformers and BERT.

Below, you’ll find a quick taxonomy of papers we reviewed.

Transformer on Image and Text

  • Supervised Multimodal Bitransformers for Classifying Images and Text (Kiela et al. 2019)
  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al. 2019)
  • VL-BERT: Pretraining of Generic Visual-Linguistic Representations (Su et al. ICLR 2020)
  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Tan et al. EMNLP 2019)

Transformers on Aligning Audio, Visual, and Text

  • Multimodal Transformer for Unaligned Multimodal Language Sequences (Tsai et al. ACL 2019)
  • Integrating Multimodal Information in Large Pretrained Transformers (Rahman et al. ACL 2020)

Transformers with Knowledge Graph Embeddings

  • Enriching BERT with Knowledge Graph Embeddings for Document Classification (Ostendorff et al. 2019)
  • ERNIE: Enhanced Language Representation with Informative Entities (Zhang et al. 2019)


Original. Reposted with permission.


Bio: Ken Gu is an Applied Research Intern at Georgian where he is working on various applied machine learning initiatives. He received his BS in Computer Science and a concentration in Mathematics from University of California Los Angeles. At UCLA, Ken has worked on research projects in Graph Deep Learning with a focus on biomedical interaction networks.