4.7 Article

Image caption generation using a dual attention mechanism

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.engappai.2023.106112

关键词

Image captioning; Inception v3; CNN; BI-LSTM; SI-EFO optimization; Meteor score

向作者/读者索取更多资源

Existing image captioning models can be classified into retrieval-oriented and generation-oriented schemes. This article introduces a new image captioning model that includes three main phases: feature extraction, dual attention generation, and caption generation. The model utilizes CNN for visual attention, LSTM for textual attention, and BI-LSTM to combine both modalities. The weights of BI-LSTM are optimized using the SI-EFO algorithm. Experimental results show improvement over several other models.
In order to create a statement that accurately captures the main idea of an ambiguous visual, which is said to be a significant and demanding task? Conventional image captioning schemes are categorized into 2 classes: retrieval-oriented schemes and generation-oriented schemes. The image caption generating system should provide precise, fluid, natural, and informative phrases as well as accurately identify the content of the image, such as scene, object, relationship, and properties of the object in the image. However, it can be challenging to accurately express the image's content when creating image captions because not all visual information can be used. In this article, a new image captioning model is introduced that includes 3 main phases like (1) Extraction of Inception V3 features (2) Dual (Visual and Textual) attention generation and (3) generation of image caption. Convolutional Neural Network (CNN) is used to generate visual attention after first deriving initial V3 features. The input texts for the associated images, on the other hand, are analyzed and given to LSTM for the creation of textual attention. To create image captions, Bidirectional LSTM (BI-LSTM) is used to combine textual and visual attention. The Self Improved Electric Fish Optimization (SI-EFO) algorithm is used in particular to optimize the weights of the BI-LSTM. In the end, several measures confirm that the implemented system has improved. The adopted model is 35.21%, 33.76%, 39.52%, 29.69%, 30.12%, 21.49%, and 31.71% better than GAN-RL, LSTM, GRU, EC + GOA, EC + CMBO, EC + DA, EC + EFO models.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据