3.8 Proceedings Paper

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Related references

Note: Only part of the references are listed.
Article Computer Science, Artificial Intelligence

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Emanuele Bugliarello et al.

Summary: The study investigates BERT models in computer vision and natural language processing, categorizing them into single-stream and dual-stream encoders and demonstrating how they can be unified under a single theoretical framework. Experimental results show that differences between models are mainly due to training data and hyperparameters, while the embedding layer also plays a crucial role in these models.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Article Computer Science, Artificial Intelligence

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Lisa Anne Hendricks et al.

Summary: Recent studies have shown that in multimodal transformer models, noise in pretraining datasets and language similarity to downstream tasks are crucial indicators of model performance, while models with multimodal attention mechanisms can outperform deeper models with modality-specific attention mechanisms. Additionally, successful contrastive losses used in self-supervised learning literature do not necessarily yield similar performance gains when applied in multimodal transformers.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Article Computer Science, Artificial Intelligence

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna et al.

INTERNATIONAL JOURNAL OF COMPUTER VISION (2017)

Proceedings Paper Computer Science, Artificial Intelligence

VQA: Visual Question Answering

Stanislaw Antol et al.

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2015)