4.6 Article

Co-attention graph convolutional network for visual question answering

Journal

MULTIMEDIA SYSTEMS
Volume -, Issue -, Pages -

Publisher

SPRINGER
DOI: 10.1007/s00530-023-01125-7

Keywords

Visual question answering; Binary relational reasoning; Spatial graph convolution; Attention mechanism

Ask authors/readers for more resources

In this work, a combination of graph convolutional network and co-attention network is proposed to address the limitations of traditional visual attention models in reasoning relationships and multimodal interactions. The model utilizes binary relational reasoning as the graph learner module to capture relationships between visual objects and learns image representation related to specific questions with spatial awareness. Experimental results show that the proposed model achieves an overall accuracy of 68.67% on the test-std set of the benchmark VQA v2.0 dataset, outperforming most existing models.
Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual content of questions. Conventional visual attention model, which is designed primarily from the perspective of attention mechanism, lacks the ability to reason about relationships between visual objects and ignores the multimodal interactions between questions and images. In this work, we propose a combined both graph convolutional network and co-attention network to circumvent the aforementioned problem. The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question that has an awareness of spatial location via spatial graph convolution. After that, we perform parallel co-attention learning by passing image representations and features of question words through a deep co-attention module. Experiment results demonstrate that the Overall accuracy of our model delivers 68.67% on the test-std set of the benchmark VQA v2.0 dataset, which outperforms most existing models.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available