Related references
Note: Only part of the references are listed.Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Emanuele Bugliarello et al.
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks et al.
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2021)
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna et al.
INTERNATIONAL JOURNAL OF COMPUTER VISION (2017)
VQA: Visual Question Answering
Stanislaw Antol et al.
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2015)