4.6 Article

Multi-modal co-attention relation networks for visual question answering

期刊

VISUAL COMPUTER
卷 39, 期 11, 页码 5783-5795

出版社

SPRINGER
DOI: 10.1007/s00371-022-02695-9

关键词

Computer vision; Visual question answering; Co-attention; Visual object relation reasoning

向作者/读者索取更多资源

In this article, a Multi-Modal Co-Attention Relation Network (MCARN) is proposed to address the problem of VQA models only modeling object-level visual representations and neglecting the relationships between visual objects. The MCARN can model visual representations at both object and relation levels, and the stacking of its visual relation reasoning module improves the accuracy on Number questions. Additionally, two models, RGF-CA and Cos-Sin+CA, are introduced which achieve excellent comprehensive performance and higher accuracy on Other questions respectively by combining co-attention with relative geometry features of visual objects.
The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (MCARN) that combines co-attention and visual object relation reasoning. MCARN can model visual representations at both object-level and relation-level, and stacking its visual relation reasoning module can further improve the accuracy of the model on Number questions. Inspired by MCARN, we propose two models, RGF-CA and Cos-Sin+CA, which combine co-attention with the relative geometry features of visual objects, and achieve excellent comprehensive performance and higher accuracy on Other questions respectively. Extensive experiments and ablation studies based on the benchmark dataset VQA 2.0 prove the effectiveness of our models, and also verify the synergy of co-attention and visual object relation reasoning in VQA task.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据