4.6 Article

Word-to-region attention network for visual question answering

Journal

MULTIMEDIA TOOLS AND APPLICATIONS
Volume 78, Issue 3, Pages 3843-3858

Publisher

SPRINGER
DOI: 10.1007/s11042-018-6389-3

Keywords

Visual question answering; Word attention; Image attention; Word-to-region

Funding

  1. National Natural Science Foundation of China [61572108, 61632007]
  2. 111 Project [B17008]

Ask authors/readers for more resources

Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available