4.7 Article

Encoder-decoder cycle for visual question answering based on perception-action cycle

期刊

PATTERN RECOGNITION
卷 144, 期 -, 页码 -

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2023.109848

关键词

Visual question answering; Vision language tasks; Multi-modality fusion; Attention; Bilinear fusion; Brain-inspired frameworks

向作者/读者索取更多资源

In this study, a novel encoder-decoder cycle (EDC) framework is proposed for tackling challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC is inspired by the perception-action cycle in human learning process and considers the understanding of visual features as perception and answering questions as an action. The framework mimics the mechanism of introspection by comprehending and refining visual features and performs cyclic decoding of visual and language features to generate answer features. Evaluation on multiple datasets demonstrates the superiority of the proposed framework over state-of-the-art models.
In this study, we propose a novel encoder-decoder cycle (EDC) framework inspired by the human learning process called the perception-action cycle to tackle challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC considers the understanding of the visual features of an image as perception and the act of answering the question regarding that image as an action. In the perception-action cycle, information is primarily collected from the environment and then passed to sensory structures in the brain to form an understanding of the environment. Acquired knowledge is then passed to motor structures to perform an action on the environment. Next, sensory structures perceive the altered environment and improve their understanding of the surrounding world. This process of understanding the environment, performing an action correspondingly, and then re-evaluating the initial understanding occurs cyclically in human life. EDC initially mimics this mechanism of introspection by comprehending and refining visual features to acquire the proper knowledge for answering the question. Subsequently, it decodes visual and language features into answer features, feeding them back cyclically to the encoder. In the VRD task, EDC decodes visual features to generate predicate features. We evaluate the proposed framework on the TDIUC, VQA 2.0, and VRD datasets, which outperforms the state-of-the-art models on the TDIUC and VRD datasets.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据