☆ 4.0 Article

Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

IMAGING SCIENCE JOURNAL (2021)

期刊

IMAGING SCIENCE JOURNAL

卷 69, 期 1-4, 页码 177-189

出版社

TAYLOR & FRANCIS LTD

DOI: 10.1080/13682199.2022.2153489

关键词

Visual question answering; co-attention; transformer; multimodal fusion

类别

Imaging Science & Photographic Technology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes a Scene Text VQA model that strengthens the representation power of text tokens by using Fast Text embedding and other features. By employing a two-way co-attention mechanism to obtain discriminative image features, combining text token position information to predict the final answer, the model achieved high accuracy on both TextVQA and ST-VQA datasets.

Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.

Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

期刊

IMAGING SCIENCE JOURNAL

出版社

TAYLOR & FRANCIS LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

期刊

IMAGING SCIENCE JOURNAL

出版社

TAYLOR & FRANCIS LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文