4.7 Article

Boosting convolutional image captioning with semantic content and visual relationship

期刊

DISPLAYS
卷 70, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.displa.2021.102069

关键词

Image captioning; Generative adversarial network; Graph convolution network

向作者/读者索取更多资源

A framework using CNN and CGAN, as well as MGCN model, was proposed for image caption generation, which can better utilize visual relationships between objects and generate captions with semantic meanings.
Image captioning aims to display automatically the natural language sentence for the image by the computer, which is an important but a challenging task which covers the fields of computer vision and natural language processing. This task is dominated by Long-short term memory (LSTM) based solutions. Although many progresses have been made based on LSTM in recent years, the model based on LSTM relies on serialized generation of descriptions, which cannot be processed in parallel and pay less attentions to the hierarchical structure of the captions. In order to solve this problem, we propose a framework using a CNN-based generation model to generate image captions with the help of conditional generative adversarial training (CGAN). Furthermore, multi-modal graph convolution network(MGCN) is used to exploit visual relationships between objects for generating the captions with semantic meanings, in which the scene graph is used as the bridge to connect objects, attributes and visual relationship information together to generate better captions. Extensive experiments are conducted on MSCOCO database and the results show that our method could achieve better or comparable scores compared with state-of-the-art methods. Ablation experimental results show that CGAN and MGCN can reflect a better visual relationships between objects in image and thus can generate better captions with richer semantic content.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据