Journal
MULTIMEDIA TOOLS AND APPLICATIONS
Volume 78, Issue 3, Pages 3843-3858Publisher
SPRINGER
DOI: 10.1007/s11042-018-6389-3
Keywords
Visual question answering; Word attention; Image attention; Word-to-region
Categories
Funding
- National Natural Science Foundation of China [61572108, 61632007]
- 111 Project [B17008]
Ask authors/readers for more resources
Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available