☆ 4.6 Article

Multi-modal co-attention relation networks for visual question answering

VISUAL COMPUTER (2023)

期刊

VISUAL COMPUTER

卷 39, 期 11, 页码 5783-5795

出版社

SPRINGER

DOI: 10.1007/s00371-022-02695-9

关键词

Computer vision; Visual question answering; Co-attention; Visual object relation reasoning

类别

Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

In this article, a Multi-Modal Co-Attention Relation Network (MCARN) is proposed to address the problem of VQA models only modeling object-level visual representations and neglecting the relationships between visual objects. The MCARN can model visual representations at both object and relation levels, and the stacking of its visual relation reasoning module improves the accuracy on Number questions. Additionally, two models, RGF-CA and Cos-Sin+CA, are introduced which achieve excellent comprehensive performance and higher accuracy on Other questions respectively by combining co-attention with relative geometry features of visual objects.

The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (MCARN) that combines co-attention and visual object relation reasoning. MCARN can model visual representations at both object-level and relation-level, and stacking its visual relation reasoning module can further improve the accuracy of the model on Number questions. Inspired by MCARN, we propose two models, RGF-CA and Cos-Sin+CA, which combine co-attention with the relative geometry features of visual objects, and achieve excellent comprehensive performance and higher accuracy on Other questions respectively. Extensive experiments and ablation studies based on the benchmark dataset VQA 2.0 prove the effectiveness of our models, and also verify the synergy of co-attention and visual object relation reasoning in VQA task.

Multi-modal co-attention relation networks for visual question answering

期刊

VISUAL COMPUTER

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Multi-modal co-attention relation networks for visual question answering

期刊

VISUAL COMPUTER

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文