4.7 Article

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Journal

REMOTE SENSING
Volume 15, Issue 3, Pages -

Publisher

MDPI
DOI: 10.3390/rs15030579

Keywords

remote sensing image captioning; cross-modal interaction; attention mechanism; semantic information; encoder-decoder

Ask authors/readers for more resources

The aim of remote sensing image captioning (RSIC) is to describe a given RSI using coherent sentences. In this paper, a multi-source interactive stair attention mechanism is proposed to model the semantics of preceding sentences and visual regions of interest separately. Furthermore, a CIDEr-based reward reinforcement learning is used to enhance the quality of the generated sentences.
The aim of remote sensing image captioning (RSIC) is to describe a given remote sensing image (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers a word vector from preceding sentences. However, these methods are indirectly guided through the confusion of attentive regions, as (1) the weighted average in the attention mechanism distracts the word vector from capturing pertinent visual regions and (2) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a multi-source interactive stair attention mechanism that separately models the semantics of preceding sentences and visual regions of interest. Specifically, the multi-source interaction takes previous semantic vectors as queries and applies an attention mechanism on regional features to acquire the next word vector, which reduces immediate hesitation by considering linguistics. The stair attention divides the attentive weights into three levels-that is, the core region, the surrounding region, and other regions-and all regions in the search scope are focused on differently. Then, a CIDEr-based reward reinforcement learning is devised, in order to enhance the quality of the generated sentences. Comprehensive experiments on widely used benchmarks (i.e., the Sydney-Captions, UCM-Captions, and RSICD data sets) demonstrate the superiority of the proposed model over state-of-the-art models, in terms of its coherence, while maintaining high accuracy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available