☆ 4.7 Article

Encoder-decoder cycle for visual question answering based on perception-action cycle

PATTERN RECOGNITION (2023)

期刊

PATTERN RECOGNITION

卷 144, 期 -, 页码 -

出版社

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2023.109848

关键词

Visual question answering; Vision language tasks; Multi-modality fusion; Attention; Bilinear fusion; Brain-inspired frameworks

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

In this study, a novel encoder-decoder cycle (EDC) framework is proposed for tackling challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC is inspired by the perception-action cycle in human learning process and considers the understanding of visual features as perception and answering questions as an action. The framework mimics the mechanism of introspection by comprehending and refining visual features and performs cyclic decoding of visual and language features to generate answer features. Evaluation on multiple datasets demonstrates the superiority of the proposed framework over state-of-the-art models.

In this study, we propose a novel encoder-decoder cycle (EDC) framework inspired by the human learning process called the perception-action cycle to tackle challenging problems such as visual question answering (VQA) and visual relationship detection (VRD). EDC considers the understanding of the visual features of an image as perception and the act of answering the question regarding that image as an action. In the perception-action cycle, information is primarily collected from the environment and then passed to sensory structures in the brain to form an understanding of the environment. Acquired knowledge is then passed to motor structures to perform an action on the environment. Next, sensory structures perceive the altered environment and improve their understanding of the surrounding world. This process of understanding the environment, performing an action correspondingly, and then re-evaluating the initial understanding occurs cyclically in human life. EDC initially mimics this mechanism of introspection by comprehending and refining visual features to acquire the proper knowledge for answering the question. Subsequently, it decodes visual and language features into answer features, feeding them back cyclically to the encoder. In the VRD task, EDC decodes visual features to generate predicate features. We evaluate the proposed framework on the TDIUC, VQA 2.0, and VRD datasets, which outperforms the state-of-the-art models on the TDIUC and VRD datasets.

Encoder-decoder cycle for visual question answering based on perception-action cycle

期刊

PATTERN RECOGNITION

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Encoder-decoder cycle for visual question answering based on perception-action cycle

期刊

PATTERN RECOGNITION

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文