☆ 4.7 Article

An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network

IEEE TRANSACTIONS ON IMAGE PROCESSING (2020)

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING

Volume 29, Issue -, Pages 9627-9640

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TIP.2020.3028651

Keywords

Generators; Decoding; Generative adversarial networks; Training; Computational modeling; Task analysis; Image captioning; ensemble generation-retrieval model; adversarial learning

Funding

National Natural Science Foundation of China [61906185, 61876208]
Natural Science Foundation of Guangdong Province of China [2019A1515011705]
Key-Area Research and Development Program of Guangdong Province [2018B010108002]
Shenzhen Basic Research Foundation [JCYJ20180302145607677, JCYJ20190808182805919]
Youth Innovation Promotion Association of CAS

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Image captioning, which aims to generate a sentence to describe the key content of a query image, is an important but challenging task. Existing image captioning approaches can be categorised into two types: generation-based methods and retrieval-based methods. Retrieval-based methods describe images by retrieving pre-existing captions from a repository. Generation-based methods synthesize a new sentence that verbalizes the query image. Both ways have certain advantages but suffer from their own disadvantages. In the paper, we propose a novel EnsCaption model, which aims at enhancing an ensemble of retrieval-based and generation-based image captioning methods through a novel dual generator generative adversarial network. Specifically, EnsCaption is composed of a caption generation model that synthesizes tailored captions for the query image, a caption re-ranking model that retrieves the best-matching caption from a candidate caption pool consisting of generated captions and pre-retrieved captions, and a discriminator that learns the multi-level difference between the generated/retrieved captions and the ground-truth captions. During the adversarial training process, the caption generation model and the caption re-ranking model provide improved synthetic and retrieved candidate captions with high ranking scores from the discriminator, while the discriminator based on multi-level ranking is trained to assign low ranking scores to the generated and retrieved image captions. Our model absorbs the merits of both generation-based and retrieval-based approaches. We conduct comprehensive experiments to evaluate the performance of EnsCaption on two benchmark datasets: MSCOCO and Flickr-30K. Experimental results show that EnsCaption achieves impressive performance compared to the strong baseline methods.

An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper