4.7 Article

Image captioning using transformer-based double attention network

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.engappai.2023.106545

关键词

Self-attention; Transformer; Image captioning; Encoder-decoder

向作者/读者索取更多资源

Image captioning is the process of generating a human-like description for a query image, which has recently gained significant attention. The most commonly used model for image description is an encoder-decoder structure, where the encoder extracts visual information and the decoder generates textual descriptions. Transformers have greatly improved the performance, but they struggle to consider complex relationships between key and query vectors. This paper proposes a new double-attention framework, utilizing a local generator module and a global generator module to collaboratively predict textual descriptions and improve the performance of image description.
Image captioning generates a human-like description for a query image, which has attracted considerable attention recently. The most broadly utilized model for image description is an encoder-decoder structure, where the encoder extracts the visual information of the image, and the decoder generates textual descriptions of the image. Transformers have significantly enhanced the performance of image description models. However, a single attention structure in transformers cannot consider more complex relationships between key and query vectors. Furthermore, attention weights are assigned to entire candidate vectors based on the assumption that entire vectors are related. In this paper, a new double-attention framework is presented, which improves the encoder-decoder structure to consider image captioning problems. Hence, a local generator module and a global generator module are designed to predict textual descriptions collaboratively. The proposed approach improves Self-Attention (SA) from two aspects to enhance the performance of image description. First, a Masked Self-Attention module is presented to attend on the most relevant information. Second, to evade a single shallow attention distribution and make deeper internal relations, a Hybrid Weight Distribution (HWD) module is proposed, that develops SA to use the relations between key and query vectors efficiently. Experiments over the Flickr30k and MS-COCO datasets prove that the proposed approach achieves desirable performance on different evaluation measures compared to the state-of-the-art frameworks.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据