☆ 4.6 Article

Cross-modality co-attention networks for visual question answering

SOFT COMPUTING (2021)

期刊

SOFT COMPUTING

卷 25, 期 7, 页码 5411-5421

出版社

SPRINGER

DOI: 10.1007/s00500-020-05539-7

关键词

Visual question answering; Cross-modality co-attention; Computer vision

类别

Computer Science, Artificial Intelligence Computer Science, Interdisciplinary Applications

资金

National Natural Science Foundation of China [61672338, 61873160]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Visual question answering (VQA) is an emerging task that combines natural language processing and computer vision technology. The proposed cross-modality co-attention network (CMCN) framework aims to improve learning both intra-modal and cross-modal relationships with a core module called cross-modality co-attention (CMC) composed of self-attention and guided-attention blocks. Experimental results show that CMCN outperforms existing methods on the VQA 2.0 dataset.

Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

Cross-modality co-attention networks for visual question answering

期刊

SOFT COMPUTING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Cross-modality co-attention networks for visual question answering

期刊

SOFT COMPUTING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文