4.7 Article

Image captioning via proximal policy optimization

期刊

IMAGE AND VISION COMPUTING
卷 108, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.imavis.2021.104126

关键词

Image captioning; Reinforcement learning; Proximal policy optimization

资金

  1. Fundamental Research Funds for the Central Universities [328201904]
  2. National Key Research Program of China [2017YFB0801803]

向作者/读者索取更多资源

Image captioning involves generating captions for images using natural language. By applying the PPO algorithm to a state-of-the-art architecture like X-Transformer, improvements in system performance can be achieved. Experimental results suggest that combining PPO with dropout regularization may decrease performance, possibly due to the KL-divergence of RL policies. Using word-level baseline estimation instead of sentence-level baseline in the policy gradient estimator can lead to better results.
Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fewer works are proposed for the RL phase. Motivated by one recent state-of-the-art architecture X-Transformer [Pan et al., CVPR 2020], we apply PPO (Proximal Policy Optimization) to it to establish a further improvement. However, trivially applying a vanilla policy gradient objective function with the clipping form of PPO would not improve the result. Therefore, we introduce certain modifications. We show that PPO is capable of enforcing trust-region constraints effectively. Also, experimentally performance decreases when PPO is combined with the regularization technique dropout. We analyze the possible reason in terms of KL-divergence of RL policies. As to the baseline adopted in the policy gradient estimator of RL, it is generally sentence-level. So all words in the same sentence use the same baseline in the gradient estimator. We instead use a word-level baseline via Monte-Carlo estimation. Thus, different words can have different baseline values. With all these, by fine-tuning a pre-trained X-Transformer, we train a single model achieving a competitive result of 133.3% on the MSCOCO Karpathy test set. Source code is available at https://github.com/lezhang-thu/xtransformer-ppo.& nbsp; (c) 2021 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据