☆ 4.7 Article

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

REMOTE SENSING (2023)

Journal

REMOTE SENSING

Volume 15, Issue 3, Pages -

Publisher

MDPI

DOI: 10.3390/rs15030579

Keywords

remote sensing image captioning; cross-modal interaction; attention mechanism; semantic information; encoder-decoder

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The aim of remote sensing image captioning (RSIC) is to describe a given RSI using coherent sentences. In this paper, a multi-source interactive stair attention mechanism is proposed to model the semantics of preceding sentences and visual regions of interest separately. Furthermore, a CIDEr-based reward reinforcement learning is used to enhance the quality of the generated sentences.

The aim of remote sensing image captioning (RSIC) is to describe a given remote sensing image (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers a word vector from preceding sentences. However, these methods are indirectly guided through the confusion of attentive regions, as (1) the weighted average in the attention mechanism distracts the word vector from capturing pertinent visual regions and (2) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a multi-source interactive stair attention mechanism that separately models the semantics of preceding sentences and visual regions of interest. Specifically, the multi-source interaction takes previous semantic vectors as queries and applies an attention mechanism on regional features to acquire the next word vector, which reduces immediate hesitation by considering linguistics. The stair attention divides the attentive weights into three levels-that is, the core region, the surrounding region, and other regions-and all regions in the search scope are focused on differently. Then, a CIDEr-based reward reinforcement learning is devised, in order to enhance the quality of the generated sentences. Comprehensive experiments on widely used benchmarks (i.e., the Sydney-Captions, UCM-Captions, and RSICD data sets) demonstrate the superiority of the proposed model over state-of-the-art models, in terms of its coherence, while maintaining high accuracy.

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Journal

REMOTE SENSING

Publisher

MDPI

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Journal

REMOTE SENSING

Publisher

MDPI

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper