4.6 Article

Multilevel attention and relation network based image captioning model

期刊

MULTIMEDIA TOOLS AND APPLICATIONS
卷 82, 期 7, 页码 10981-11003

出版社

SPRINGER
DOI: 10.1007/s11042-022-13793-0

关键词

Relation network; Semantic; Attention; Encoder-decoder

向作者/读者索取更多资源

This paper presents a method for image captioning using a Local Relation Network (LRN) to understand the semantic concepts of objects and their relationships in an image. The proposed model achieves superior performance on three benchmark datasets, demonstrating its effectiveness in generating natural language descriptions.
The aim of the image captioning task is to understand various semantic concepts such as objects and their relationships in an image and combine them to generate a natural language description. Thus, it needs an algorithm to understand the visual content of a given image and translates it into a sequence of output words. In this paper, a Local Relation Network (LRN) is designed over the objects and image regions which not only discovers the relationship between the object and the image regions but also generates significant context-based features corresponding to every region in the image. Also, a multilevel attention approach is used to focus on a given image region and its related image regions, thus enhancing the image representation capability of the proposed method. Finally, a variant of traditional long-short term memory (LSTM), which uses an attention mechanism, is employed which focuses on relevant contextual information, spatial locations, and deep visual features. With these measures, the proposed model encodes an image in an improved way, which gives the model significant cues and thus leads to improved caption generation. Extensive experiments have been performed on three benchmark datasets: Flickr30k, MSCOCO, and Nocaps. On Flickr30k, the obtained evaluation scores are 31.2 BLEU@4, 23.5 METEOR, 51.5 ROUGE, 65.6 CIDEr and 17.2 SPICE. On MSCOCO, the proposed model has attained 42.4 BLEU@4, 29.4 METEOR, 59.7 ROUGE, 125.7 CIDEr and 23.2 SPICE. The overall CIDEr score on Nocaps dataset achieved by the proposed model is 114.3. The above scores clearly show the superiority of the proposed method over the existing methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据