4.6 Article

Cross-modality co-attention networks for visual question answering

期刊

SOFT COMPUTING
卷 25, 期 7, 页码 5411-5421

出版社

SPRINGER
DOI: 10.1007/s00500-020-05539-7

关键词

Visual question answering; Cross-modality co-attention; Computer vision

资金

  1. National Natural Science Foundation of China [61672338, 61873160]

向作者/读者索取更多资源

Visual question answering (VQA) is an emerging task that combines natural language processing and computer vision technology. The proposed cross-modality co-attention network (CMCN) framework aims to improve learning both intra-modal and cross-modal relationships with a core module called cross-modality co-attention (CMC) composed of self-attention and guided-attention blocks. Experimental results show that CMCN outperforms existing methods on the VQA 2.0 dataset.
Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据