Journal
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022)
Volume -, Issue -, Pages 7996-7999Publisher
IEEE
DOI: 10.1109/IGARSS46834.2022.9883199
Keywords
Remote sensing image caption; Transformer
Categories
Funding
- National Natural Science Foundation of China [42071350, 42171336]
- LIESMARS Special Research Funding
Ask authors/readers for more resources
This paper proposes a pure Transformer (CapFormer) architecture for accurately describing high-spatial resolution remote sensing images. By adopting a scalable vision Transformer and a Transformer decoder, CapFormer outperforms the state-of-the-art image caption methods in summarizing complex scenes.
Accurately describing high-spatial resolution remote sensing images requires the understanding the inner attributes of the objects and the outer relations between different objects. The existing image caption algorithms lack the ability of global representation, which are not fit for the summarization of complex scenes. To this end, we propose a pure transformer (CapFormer) architecture for remote sensing image caption. Specifically, a scalable vision transformer is adopted for image representation, where the global content can be captured with multi-head self-attention layers. A transformer decoder is designed to successively translate the image features into comprehensive sentences. The transformer decoder explicitly model the historical words and interact with the image features using cross-attention layers. The comprehensive and ablation experiments on RSICD dataset demonstrate that the CapFormer outperforms the state-of-the-art image caption methods.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available