☆ 4.6 Article

Open-ended remote sensing visual question answering with transformers

INTERNATIONAL JOURNAL OF REMOTE SENSING (2022)

Journal

INTERNATIONAL JOURNAL OF REMOTE SENSING

Volume 43, Issue 18, Pages 6809-6823

Publisher

TAYLOR & FRANCIS LTD

DOI: 10.1080/01431161.2022.2145583

Keywords

Visual question answering; remote sensing; open-set dataset; vision transformers; encoder-decoder architecture

Funding

King Saud University [RSP-2021/20]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper introduces a new dataset named VQA-TextRS and a proposed encoder-decoder architecture for open-ended visual question answering. By utilizing self-attention property, vision and textual cues are extracted from the image and question, and fused together through cross-attention mechanism to generate the final answer.

Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images.

Open-ended remote sensing visual question answering with transformers

Journal

INTERNATIONAL JOURNAL OF REMOTE SENSING

Publisher

TAYLOR & FRANCIS LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Open-ended remote sensing visual question answering with transformers

Journal

INTERNATIONAL JOURNAL OF REMOTE SENSING

Publisher

TAYLOR & FRANCIS LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper