☆ 3.8 Proceedings Paper

Multi-modal Scene Recognition Based on Global Self-attention Mechanism

ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022 (2023)

期刊

ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022

卷 153, 期 -, 页码 109-121

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

DOI: 10.1007/978-3-031-20738-9_14

关键词

Scene recognition; Multi-modal; Transformer; RGB-D

类别

Computer Science, Artificial Intelligence Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes an end-to-end trainable network model MSR-Trans based on the global self-attention mechanism for multi-modal scene recognition. The model utilizes two transformer-based branches to extract features from RGB image and depth data, and then uses a fusion layer to fuse these features for final scene recognition. Lateral connections are added on some layers between the two branches to explore the relationship between multi-modal information, and a dropout layer is embedded in the transformer block to prevent overfitting. Extensive experiments on SUN RGB-D and NYUD2 datasets show that the proposed method achieves recognition accuracies of 69.0% and 74.1% for multi-modal scene recognition, respectively.

With the rapid development of deep neural network and the emergence of multi-modal acquisition devices, multi-modal scene recognition based on deep neural network has been known as a research hotspot. In view of the characteristics of various objects and complex spatial layout in scene images, and the complementarity of multi-modal data, this paper proposes an end-to-end trainable network model based on global self-attention mechanism for multi-modal scene recognition. This model, which is named MSR-Trans, is mainly consisted of two transformer-based branches for extracting feature from RGB image and depth data, respectively. Then, a fusion layer is used to fuse these two features for final scene recognition. To further explore the relationship between multi-modal information, the lateral connections are added on some layers between the two branches. And, a dropout layer is embedded in transformer block for preventing the model from overfitting. Extensive experiments are conducted to test the performance of the proposed method on SUN RGB-D and NYUD2 datasets, and the recognition accuracies of multi-modal scene recognition can be achieved at 69.0% and 74.1%, respectively.

Multi-modal Scene Recognition Based on Global Self-attention Mechanism

期刊

ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Multi-modal Scene Recognition Based on Global Self-attention Mechanism

期刊

ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文