☆ 4.5 Article

Multi-scale relation reasoning for multi-modal Visual Question Answering

SIGNAL PROCESSING-IMAGE COMMUNICATION (2021)

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Volume 96, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.image.2021.116319

Keywords

Multi-modal data; Visual Question Answering; Multi-scale relation reasoning; Attention model

Funding

National Key R&D Program of China [2018YFC0407901]
Fundamental Research Funds for the Central Universities, China [B200202177]
Natural Science Foundation of China [61702160]
Natural Science Foundation of Jiangsu Province, China [BK20170892]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a method using a deep neural network for multi-modal relation reasoning, which successfully constructs a regional attention scheme and multi-scale property to accurately answer questions about images.

The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we firstly design regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by regional attention scheme are fused in scaled combinations, thus generating more distinctive features with scalable information. Due to designs of regional attention and multi-scale property, the proposed method is capable to describe scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiencies than most of the existing methods.

Multi-scale relation reasoning for multi-modal Visual Question Answering

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multi-scale relation reasoning for multi-modal Visual Question Answering

Journal

SIGNAL PROCESSING-IMAGE COMMUNICATION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper