3.8 Proceedings Paper

Multi-modal Scene Recognition Based on Global Self-attention Mechanism

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-031-20738-9_14

Keywords

Scene recognition; Multi-modal; Transformer; RGB-D

Ask authors/readers for more resources

This paper proposes an end-to-end trainable network model MSR-Trans based on the global self-attention mechanism for multi-modal scene recognition. The model utilizes two transformer-based branches to extract features from RGB image and depth data, and then uses a fusion layer to fuse these features for final scene recognition. Lateral connections are added on some layers between the two branches to explore the relationship between multi-modal information, and a dropout layer is embedded in the transformer block to prevent overfitting. Extensive experiments on SUN RGB-D and NYUD2 datasets show that the proposed method achieves recognition accuracies of 69.0% and 74.1% for multi-modal scene recognition, respectively.
With the rapid development of deep neural network and the emergence of multi-modal acquisition devices, multi-modal scene recognition based on deep neural network has been known as a research hotspot. In view of the characteristics of various objects and complex spatial layout in scene images, and the complementarity of multi-modal data, this paper proposes an end-to-end trainable network model based on global self-attention mechanism for multi-modal scene recognition. This model, which is named MSR-Trans, is mainly consisted of two transformer-based branches for extracting feature from RGB image and depth data, respectively. Then, a fusion layer is used to fuse these two features for final scene recognition. To further explore the relationship between multi-modal information, the lateral connections are added on some layers between the two branches. And, a dropout layer is embedded in transformer block for preventing the model from overfitting. Extensive experiments are conducted to test the performance of the proposed method on SUN RGB-D and NYUD2 datasets, and the recognition accuracies of multi-modal scene recognition can be achieved at 69.0% and 74.1%, respectively.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available