☆ 4.7 Article

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 25, Issue -, Pages 5344-5357

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2022.3190686

Keywords

Depth estimation; multi-modal representation; relational reasoning; visual question answering

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The researchers found that previous visual relationship understanding models have problems in accurate reasoning, so they proposed a new model called DSGANet. This model models the relationship between objects in three-dimensional space and explicitly aligns the relationships to address the deficiencies in existing models. The experiments show that DSGANet achieves competitive performance on multiple benchmark datasets.

Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper