4.7 Article

Image captioning via proximal policy optimization

Journal

IMAGE AND VISION COMPUTING
Volume 108, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.imavis.2021.104126

Keywords

Image captioning; Reinforcement learning; Proximal policy optimization

Funding

  1. Fundamental Research Funds for the Central Universities [328201904]
  2. National Key Research Program of China [2017YFB0801803]

Ask authors/readers for more resources

Image captioning involves generating captions for images using natural language. By applying the PPO algorithm to a state-of-the-art architecture like X-Transformer, improvements in system performance can be achieved. Experimental results suggest that combining PPO with dropout regularization may decrease performance, possibly due to the KL-divergence of RL policies. Using word-level baseline estimation instead of sentence-level baseline in the policy gradient estimator can lead to better results.
Image captioning is the task of generating captions of images in natural language. The training typically consists of two phases, first minimizing the XE (cross-entropy) loss, and then with RL (reinforcement learning) over CIDEr scores. Although there are many innovations in neural architectures, fewer works are proposed for the RL phase. Motivated by one recent state-of-the-art architecture X-Transformer [Pan et al., CVPR 2020], we apply PPO (Proximal Policy Optimization) to it to establish a further improvement. However, trivially applying a vanilla policy gradient objective function with the clipping form of PPO would not improve the result. Therefore, we introduce certain modifications. We show that PPO is capable of enforcing trust-region constraints effectively. Also, experimentally performance decreases when PPO is combined with the regularization technique dropout. We analyze the possible reason in terms of KL-divergence of RL policies. As to the baseline adopted in the policy gradient estimator of RL, it is generally sentence-level. So all words in the same sentence use the same baseline in the gradient estimator. We instead use a word-level baseline via Monte-Carlo estimation. Thus, different words can have different baseline values. With all these, by fine-tuning a pre-trained X-Transformer, we train a single model achieving a competitive result of 133.3% on the MSCOCO Karpathy test set. Source code is available at https://github.com/lezhang-thu/xtransformer-ppo.& nbsp; (c) 2021 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available