3.8 Proceedings Paper

CAPFORMER: PURE TRANSFORMER FOR REMOTE SENSING IMAGE CAPTION

Publisher

IEEE
DOI: 10.1109/IGARSS46834.2022.9883199

Keywords

Remote sensing image caption; Transformer

Funding

  1. National Natural Science Foundation of China [42071350, 42171336]
  2. LIESMARS Special Research Funding

Ask authors/readers for more resources

This paper proposes a pure Transformer (CapFormer) architecture for accurately describing high-spatial resolution remote sensing images. By adopting a scalable vision Transformer and a Transformer decoder, CapFormer outperforms the state-of-the-art image caption methods in summarizing complex scenes.
Accurately describing high-spatial resolution remote sensing images requires the understanding the inner attributes of the objects and the outer relations between different objects. The existing image caption algorithms lack the ability of global representation, which are not fit for the summarization of complex scenes. To this end, we propose a pure transformer (CapFormer) architecture for remote sensing image caption. Specifically, a scalable vision transformer is adopted for image representation, where the global content can be captured with multi-head self-attention layers. A transformer decoder is designed to successively translate the image features into comprehensive sentences. The transformer decoder explicitly model the historical words and interact with the image features using cross-attention layers. The comprehensive and ablation experiments on RSICD dataset demonstrate that the CapFormer outperforms the state-of-the-art image caption methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available