3.8 Proceedings Paper

Multi-modal Scene Recognition Based on Global Self-attention Mechanism

出版社

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-031-20738-9_14

关键词

Scene recognition; Multi-modal; Transformer; RGB-D

向作者/读者索取更多资源

This paper proposes an end-to-end trainable network model MSR-Trans based on the global self-attention mechanism for multi-modal scene recognition. The model utilizes two transformer-based branches to extract features from RGB image and depth data, and then uses a fusion layer to fuse these features for final scene recognition. Lateral connections are added on some layers between the two branches to explore the relationship between multi-modal information, and a dropout layer is embedded in the transformer block to prevent overfitting. Extensive experiments on SUN RGB-D and NYUD2 datasets show that the proposed method achieves recognition accuracies of 69.0% and 74.1% for multi-modal scene recognition, respectively.
With the rapid development of deep neural network and the emergence of multi-modal acquisition devices, multi-modal scene recognition based on deep neural network has been known as a research hotspot. In view of the characteristics of various objects and complex spatial layout in scene images, and the complementarity of multi-modal data, this paper proposes an end-to-end trainable network model based on global self-attention mechanism for multi-modal scene recognition. This model, which is named MSR-Trans, is mainly consisted of two transformer-based branches for extracting feature from RGB image and depth data, respectively. Then, a fusion layer is used to fuse these two features for final scene recognition. To further explore the relationship between multi-modal information, the lateral connections are added on some layers between the two branches. And, a dropout layer is embedded in transformer block for preventing the model from overfitting. Extensive experiments are conducted to test the performance of the proposed method on SUN RGB-D and NYUD2 datasets, and the recognition accuracies of multi-modal scene recognition can be achieved at 69.0% and 74.1%, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据