☆ 3.8 Proceedings Paper

An Empirical Study of Training End-to-End Vision-and-Language Transformers

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Related references

Note: Only part of the references are listed.

Article Computer Science, Artificial Intelligence

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Emanuele Bugliarello et al.

Summary: The study investigates BERT models in computer vision and natural language processing, categorizing them into single-stream and dual-stream encoders and demonstrating how they can be unified under a single theoretical framework. Experimental results show that differences between models are mainly due to training data and hyperparameters, while the embedding layer also plays a crucial role in these models.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Lisa Anne Hendricks et al.

Summary: Recent studies have shown that in multimodal transformer models, noise in pretraining datasets and language similarity to downstream tasks are crucial indicators of model performance, while models with multimodal attention mechanisms can outperform deeper models with modality-specific attention mechanisms. Additionally, successful contrastive losses used in self-supervised learning literature do not necessarily yield similar performance gains when applied in multimodal transformers.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence