☆ 4.7 Article

Dual Attention on Pyramid Feature Maps for Image Captioning

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 24, Issue -, Pages 1775-1786

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2021.3072479

Keywords

Visualization; Decoding; Task analysis; Semantics; Feature extraction; Two dimensional displays; Context modeling; Image captioning; dual attention; pyramid attention

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a method of applying dual attention on pyramid image feature maps to improve the quality of generated sentences from images. The method achieves impressive results on multiple datasets and has a highly modular nature, making it easily applicable to other image captioning modules.

Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30 K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, the composite model can boost the performance of image-to-sentence translation, with a limited computational resource overhead. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.

Dual Attention on Pyramid Feature Maps for Image Captioning

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Dual Attention on Pyramid Feature Maps for Image Captioning

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper