☆ 4.8 Article

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Volume 45, Issue 10, Pages 11624-11641

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2023.3284038

Keywords

Visual question answering; causal inference; cross-modal reasoning; video event understanding

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study proposes a framework for cross-modal causal relational reasoning to address the limitations of existing visual question answering methods. By introducing causal intervention operations and combining modules such as causality-aware visual-linguistic reasoning, spatial-temporal transformer, and visual-linguistic feature fusion, the framework is able to discover visual-linguistic causal structures and achieve robust event-level visual question answering.

Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper