☆ 4.5 Article

A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism

NEURAL PROCESSING LETTERS (2023)

Journal

NEURAL PROCESSING LETTERS

Volume 55, Issue 5, Pages 5693-5715

Publisher

SPRINGER

DOI: 10.1007/s11063-022-11106-y

Keywords

Relation network; Semantic; Attention; Encoder-decoder

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper presents an algorithm for image captioning that utilizes a local relation network and multilevel attention to understand semantic concepts in an image, improving the image representation and resulting in improved caption generation.

Understanding different semantic concepts, such as objects and their relationships in an image, and integrating them to produce a natural language description is the goal of the image captioning task. Thus, it needs an algorithm to understand the visual content of a given image and translates it into a sequence of output words. In this paper, a local relation network is designed over the objects and image regions which not only discovers the relationship between the object and the image regions but also generates significant context-based features corresponding to every region in the image. Inspired by transformer model, we have employed a multilevel attention comprising of self-attention and guided attention to focus on a given image region and its related image regions, thus enhancing the image representation capability of the proposed method. Finally, a variant of traditional long-short term memory, which uses an attention mechanism, is employed which focuses on relevant contextual information, spatial locations, and deep visual features. With these measures, the proposed model encodes an image in an improved way, which gives the model significant cues and thus leads to improved caption generation. Extensive experiments have been performed on three benchmark datasets: Flickr30k, MSCOCO, and Nocaps. On Flickr30k, the obtained evaluation scores are 31.2 BLEU@4, 23.5 METEOR, 51.5 ROUGE, 65.6 CIDEr and 17.2 SPICE. On MSCOCO, the proposed model has attained 42.4 BLEU@4, 29.4 METEOR, 59.7 ROUGE, 125.7 CIDEr and 23.2 SPICE. The overall CIDEr score on Nocaps dataset achieved by the proposed model is 114.3. The above scores clearly show the superiority of the proposed method over the existing methods.

A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism

Journal

NEURAL PROCESSING LETTERS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism

Journal

NEURAL PROCESSING LETTERS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper