4.7 Article

Re-Caption: Saliency-Enhanced Image Captioning Through Two-Phase Learning

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING
Volume 29, Issue -, Pages 694-709

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIP.2019.2928144

Keywords

Visualization; Semantics; Adaptation models; Computational modeling; Predictive models; Task analysis; Fans; Image captioning; robust estimation; saliency; salient region detection; two-phase learning; visual attribute

Funding

  1. National Natural Science Foundation of China [61572140, 61976057]
  2. Shanghai Municipal RD Foundation [17DZ1100504, 16JC1420401]
  3. Shanghai Natural Science Foundation [19ZR1417200]
  4. Humanities and Social Sciences Planning Fund of Ministry of Education of China [19YJA630116]
  5. Henry Tippie Endowed Chair Fund from The University of Iowa

Ask authors/readers for more resources

Visual saliency and semantic saliency are important in image captioning. However, a single-phase image captioning model benefits little from limited saliency information without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance single-phase image captioning. In the framework, both visual and semantic saliency cues are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency predictor. The semantic saliency mechanism sheds some lights on the properties of those words with the part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to compute the saliency degree of each sample, which is helpful for more robust image captioning. In addition, how to combine the three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and the related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available