Journal
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL
Volume -, Issue -, Pages 207-211Publisher
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3323873.3325044
Keywords
visual question answering; scene understanding; language understanding
Ask authors/readers for more resources
Given a photograph, the task of Visual Question Answering (VQA) requires joint image and language understanding to answer a question. It is challenging in effectively extracting the visual representation of images, and efficiently embedding the textual sentences of questions. To address these challenges, we propose a VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model. Particularly, the stacked self-attention mechanism proposed enables the model to not only focus on a simple object but also the relations between objects. Furthermore, the BERT model is learned in an end-to-end manner to better embed the question sentences. Our model is validated on the well-known VQA v2.0 dataset, and achieves the state-of-the-art results.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available