☆ 4.7 Article

Image captioning using transformer-based double attention network

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2023)

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

卷 125, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.engappai.2023.106545

关键词

Self-attention; Transformer; Image captioning; Encoder-decoder

类别

Automation & Control Systems Computer Science, Artificial Intelligence Engineering, Multidisciplinary Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Image captioning is the process of generating a human-like description for a query image, which has recently gained significant attention. The most commonly used model for image description is an encoder-decoder structure, where the encoder extracts visual information and the decoder generates textual descriptions. Transformers have greatly improved the performance, but they struggle to consider complex relationships between key and query vectors. This paper proposes a new double-attention framework, utilizing a local generator module and a global generator module to collaboratively predict textual descriptions and improve the performance of image description.

Image captioning generates a human-like description for a query image, which has attracted considerable attention recently. The most broadly utilized model for image description is an encoder-decoder structure, where the encoder extracts the visual information of the image, and the decoder generates textual descriptions of the image. Transformers have significantly enhanced the performance of image description models. However, a single attention structure in transformers cannot consider more complex relationships between key and query vectors. Furthermore, attention weights are assigned to entire candidate vectors based on the assumption that entire vectors are related. In this paper, a new double-attention framework is presented, which improves the encoder-decoder structure to consider image captioning problems. Hence, a local generator module and a global generator module are designed to predict textual descriptions collaboratively. The proposed approach improves Self-Attention (SA) from two aspects to enhance the performance of image description. First, a Masked Self-Attention module is presented to attend on the most relevant information. Second, to evade a single shallow attention distribution and make deeper internal relations, a Hybrid Weight Distribution (HWD) module is proposed, that develops SA to use the relations between key and query vectors efficiently. Experiments over the Flickr30k and MS-COCO datasets prove that the proposed approach achieves desirable performance on different evaluation measures compared to the state-of-the-art frameworks.

Image captioning using transformer-based double attention network

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Image captioning using transformer-based double attention network

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文