☆ 4.7 Article

Label-attention transformer with geometrically coherent objects for image captioning

INFORMATION SCIENCES (2023)

Journal

INFORMATION SCIENCES

Volume 623, Issue -, Pages 812-831

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2022.12.018

Keywords

Image captioning; Transformers; Self-attention; Label-attention; Geometrically coherent proposals; Memory-augmented-attention

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Encoder-decoder-based image captioning utilizes the transformer and investigates two unexplored ideas, including an object-focused label attention module (LAM) and a geometrically coherent proposal (GCP) module. These modules enforce objects' relevance and explore the effectiveness of learning the association between vision and language constructs. Experimental results show that the proposed framework, LATGeO, generates improved and meaningful captions.

Encoder-decoder-based image captioning techniques are generally utilized to describe meaningful information present in an image. In this work, we investigate two unexplored ideas for image captioning using the transformer: 1) an object-focused label attention module (LAM), and 2) a geometrically coherent proposal (GCP) module that focuses on the scale and position of objects to benefit the transformer model by attaining better image perception. These modules demonstrate the enforcement of objects' relevance in the sur-rounding environment. Furthermore, they explore the effectiveness of learning an explicit association between vision and language constructs. LAM and GCP tolerate the variation in objects' class and its association with labels in multi-label classification. The proposed framework, label-attention transformer with geometrically coherent objects (LATGeO), acquires proposals of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using LAM. The module LAM associates the extracted objects classes to the available dictionary using self-attention lay-ers. Object coherence is acquired in the GCP module using the localized ratio of the propos-als' geometrical features. In this study, experimentation results are performed on MSCOCO dataset. The evaluation of LATGeO on MSCOCO advocates that objects' relevance in sur-roundings and their visual features binding with geometrically localized ratios and associ-ated labels generate improved and meaningful captions.(c) 2022 Published by Elsevier Inc.

Label-attention transformer with geometrically coherent objects for image captioning

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Label-attention transformer with geometrically coherent objects for image captioning

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper