☆ 4.7 Article

Image caption generation using a dual attention mechanism

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2023)

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

卷 123, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.engappai.2023.106112

关键词

Image captioning; Inception v3; CNN; BI-LSTM; SI-EFO optimization; Meteor score

类别

Automation & Control Systems Computer Science, Artificial Intelligence Engineering, Multidisciplinary Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Existing image captioning models can be classified into retrieval-oriented and generation-oriented schemes. This article introduces a new image captioning model that includes three main phases: feature extraction, dual attention generation, and caption generation. The model utilizes CNN for visual attention, LSTM for textual attention, and BI-LSTM to combine both modalities. The weights of BI-LSTM are optimized using the SI-EFO algorithm. Experimental results show improvement over several other models.

In order to create a statement that accurately captures the main idea of an ambiguous visual, which is said to be a significant and demanding task? Conventional image captioning schemes are categorized into 2 classes: retrieval-oriented schemes and generation-oriented schemes. The image caption generating system should provide precise, fluid, natural, and informative phrases as well as accurately identify the content of the image, such as scene, object, relationship, and properties of the object in the image. However, it can be challenging to accurately express the image's content when creating image captions because not all visual information can be used. In this article, a new image captioning model is introduced that includes 3 main phases like (1) Extraction of Inception V3 features (2) Dual (Visual and Textual) attention generation and (3) generation of image caption. Convolutional Neural Network (CNN) is used to generate visual attention after first deriving initial V3 features. The input texts for the associated images, on the other hand, are analyzed and given to LSTM for the creation of textual attention. To create image captions, Bidirectional LSTM (BI-LSTM) is used to combine textual and visual attention. The Self Improved Electric Fish Optimization (SI-EFO) algorithm is used in particular to optimize the weights of the BI-LSTM. In the end, several measures confirm that the implemented system has improved. The adopted model is 35.21%, 33.76%, 39.52%, 29.69%, 30.12%, 21.49%, and 31.71% better than GAN-RL, LSTM, GRU, EC + GOA, EC + CMBO, EC + DA, EC + EFO models.

Image caption generation using a dual attention mechanism

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Image caption generation using a dual attention mechanism

期刊

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文