☆ 4.7 Article

Transformer-based local-global guidance for image captioning

EXPERT SYSTEMS WITH APPLICATIONS (2023)

期刊

EXPERT SYSTEMS WITH APPLICATIONS

卷 223, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.eswa.2023.119774

关键词

Attention; Transformer; Encoder -decoder; Image captioning; Deep learning

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic Operations Research & Management Science

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes a transformer-based image captioning structure to address the challenges and limitations in image captioning. By designing a generator network and a selector network, textual descriptions are generated collaboratively and the text-image relation is learned. Experimental results demonstrate that the proposed approach outperforms state-of-the-art models on COCO and Flickr datasets.

Image captioning is a difficult problem for machine learning algorithms to compress huge amounts of images into descriptive languages. The recurrent models are popularly used as the decoder to extract the caption with sig-nificant performance, while these models have complicated and inherently sequential overtime issues. Recently, transformers provide modeling long dependencies and support parallel processing of sequences compared to recurrent models. However, recent transformer-based models assign attention weights to all candidate vectors based on the assumption that all vectors are relevant and ignore the intra-object relationships. Besides, the complex relationships between key and query vectors cannot be provided using a single attention mechanism. In this paper, a new transformer-based image captioning structure without recurrence and convolution is proposed to address these issues. To this end, a generator network and a selector network to generate textual descriptions collaboratively are designed. Our work contains three main steps: (1) Design a transformer-based generator network as word-level guidance to generate next words based on the current state. (2) Train a latent space to learn the mapping of captions and images into the same embedding space to learn the text-image relation. (3) Design a selector network as sentence-level guidance to evaluate next words by assigning fitness scores to the partial captions through the embedding space. Compared with the architecture of existing methods, the proposed approach contains an attention mechanism without the dependencies of time. It executes each state to select the next best word using local-global guidance. In addition, the proposed model maintains dependencies between the sequences, and can be trained in parallel. Several experiments on the COCO and Flickr datasets demonstrate that the proposed approach can outperform various state-of-the-art models over well-known evaluation measures.

Transformer-based local-global guidance for image captioning

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Transformer-based local-global guidance for image captioning

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文