4.7 Article

Image caption generation using a dual attention mechanism

Journal

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.engappai.2023.106112

Keywords

Image captioning; Inception v3; CNN; BI-LSTM; SI-EFO optimization; Meteor score

Ask authors/readers for more resources

Existing image captioning models can be classified into retrieval-oriented and generation-oriented schemes. This article introduces a new image captioning model that includes three main phases: feature extraction, dual attention generation, and caption generation. The model utilizes CNN for visual attention, LSTM for textual attention, and BI-LSTM to combine both modalities. The weights of BI-LSTM are optimized using the SI-EFO algorithm. Experimental results show improvement over several other models.
In order to create a statement that accurately captures the main idea of an ambiguous visual, which is said to be a significant and demanding task? Conventional image captioning schemes are categorized into 2 classes: retrieval-oriented schemes and generation-oriented schemes. The image caption generating system should provide precise, fluid, natural, and informative phrases as well as accurately identify the content of the image, such as scene, object, relationship, and properties of the object in the image. However, it can be challenging to accurately express the image's content when creating image captions because not all visual information can be used. In this article, a new image captioning model is introduced that includes 3 main phases like (1) Extraction of Inception V3 features (2) Dual (Visual and Textual) attention generation and (3) generation of image caption. Convolutional Neural Network (CNN) is used to generate visual attention after first deriving initial V3 features. The input texts for the associated images, on the other hand, are analyzed and given to LSTM for the creation of textual attention. To create image captions, Bidirectional LSTM (BI-LSTM) is used to combine textual and visual attention. The Self Improved Electric Fish Optimization (SI-EFO) algorithm is used in particular to optimize the weights of the BI-LSTM. In the end, several measures confirm that the implemented system has improved. The adopted model is 35.21%, 33.76%, 39.52%, 29.69%, 30.12%, 21.49%, and 31.71% better than GAN-RL, LSTM, GRU, EC + GOA, EC + CMBO, EC + DA, EC + EFO models.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available