☆ 4.7 Article

Image captioning via proximal policy optimization

IMAGE AND VISION COMPUTING (2021)

期刊

IMAGE AND VISION COMPUTING

卷 108, 期 -, 页码 -

出版社

ELSEVIER

DOI: 10.1016/j.imavis.2021.104126

关键词

Image captioning; Reinforcement learning; Proximal policy optimization

类别

Computer Science, Artificial Intelligence Computer Science, Software Engineering Computer Science, Theory & Methods Engineering, Electrical & Electronic Optics

资金

Fundamental Research Funds for the Central Universities [328201904]
National Key Research Program of China [2017YFB0801803]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Image captioning involves generating captions for images using natural language. By applying the PPO algorithm to a state-of-the-art architecture like X-Transformer, improvements in system performance can be achieved. Experimental results suggest that combining PPO with dropout regularization may decrease performance, possibly due to the KL-divergence of RL policies. Using word-level baseline estimation instead of sentence-level baseline in the policy gradient estimator can lead to better results.

Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fewer works are proposed for the RL phase. Motivated by one recent state-of-the-art architecture X-Transformer [Pan et al., CVPR 2020], we apply PPO (Proximal Policy Optimization) to it to establish a further improvement. However, trivially applying a vanilla policy gradient objective function with the clipping form of PPO would not improve the result. Therefore, we introduce certain modifications. We show that PPO is capable of enforcing trust-region constraints effectively. Also, experimentally performance decreases when PPO is combined with the regularization technique dropout. We analyze the possible reason in terms of KL-divergence of RL policies. As to the baseline adopted in the policy gradient estimator of RL, it is generally sentence-level. So all words in the same sentence use the same baseline in the gradient estimator. We instead use a word-level baseline via Monte-Carlo estimation. Thus, different words can have different baseline values. With all these, by fine-tuning a pre-trained X-Transformer, we train a single model achieving a competitive result of 133.3% on the MSCOCO Karpathy test set. Source code is available at https://github.com/lezhang-thu/xtransformer-ppo.& nbsp; (c) 2021 Elsevier B.V. All rights reserved.

Image captioning via proximal policy optimization

期刊

IMAGE AND VISION COMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Image captioning via proximal policy optimization

期刊

IMAGE AND VISION COMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文