4.7 Article

Label-attention transformer with geometrically coherent objects for image captioning

Journal

INFORMATION SCIENCES
Volume 623, Issue -, Pages 812-831

Publisher

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2022.12.018

Keywords

Image captioning; Transformers; Self-attention; Label-attention; Geometrically coherent proposals; Memory-augmented-attention

Ask authors/readers for more resources

Encoder-decoder-based image captioning utilizes the transformer and investigates two unexplored ideas, including an object-focused label attention module (LAM) and a geometrically coherent proposal (GCP) module. These modules enforce objects' relevance and explore the effectiveness of learning the association between vision and language constructs. Experimental results show that the proposed framework, LATGeO, generates improved and meaningful captions.
Encoder-decoder-based image captioning techniques are generally utilized to describe meaningful information present in an image. In this work, we investigate two unexplored ideas for image captioning using the transformer: 1) an object-focused label attention module (LAM), and 2) a geometrically coherent proposal (GCP) module that focuses on the scale and position of objects to benefit the transformer model by attaining better image perception. These modules demonstrate the enforcement of objects' relevance in the sur-rounding environment. Furthermore, they explore the effectiveness of learning an explicit association between vision and language constructs. LAM and GCP tolerate the variation in objects' class and its association with labels in multi-label classification. The proposed framework, label-attention transformer with geometrically coherent objects (LATGeO), acquires proposals of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using LAM. The module LAM associates the extracted objects classes to the available dictionary using self-attention lay-ers. Object coherence is acquired in the GCP module using the localized ratio of the propos-als' geometrical features. In this study, experimentation results are performed on MSCOCO dataset. The evaluation of LATGeO on MSCOCO advocates that objects' relevance in sur-roundings and their visual features binding with geometrically localized ratios and associ-ated labels generate improved and meaningful captions.(c) 2022 Published by Elsevier Inc.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available