4.5 Article

Sparse co-attention visual question answering networks based on thresholds

期刊

APPLIED INTELLIGENCE
卷 53, 期 1, 页码 586-600

出版社

SPRINGER
DOI: 10.1007/s10489-022-03559-4

关键词

Visual question answering; Sparse co-attention; Attention score; Threshold

向作者/读者索取更多资源

The paper proposes a Sparse Co-Attention Visual Question Answering Network (SCAVQAN) based on thresholds to improve performance by concentrating model attention. Experimental results on two benchmark VQA datasets demonstrate the effectiveness and interpretability of the model.
Most existing visual question answering (VQA) models choose to model the dense interactions between each image region and each question word when learning the co-attention between the input images and the input questions. However, to correctly answer a natural language question related to the content of an image usually only requires understanding a few key words of the input question and capturing the visual information contained in a few regions of the input image. The noise information generated by the interactions between the image regions unrelated to the input questions and the question words unrelated to the prediction of the correct answers will distract VQA models and negatively affect the performance of the models. In this paper, to solve this problem, we propose a Sparse Co-Attention Visual Question Answering Network (SCAVQAN) based on thresholds. SCAVQAN concentrates the attention of the model by setting thresholds for attention scores to filter out the image features and the question features that are the most helpful for predicting the correct answers and finally improves the overall performance of the model. Experimental results, ablation studies and attention visualization results based on two benchmark VQA datasets demonstrate the effectiveness and interpretability of our models.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据