☆ 4.6 Article

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2022)

期刊

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS

卷 18, 期 3, 页码 -

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3491228

关键词

Action recognition; neural networks; attention; multi-modality; feature fusion

类别

Computer Science, Information Systems Computer Science, Software Engineering Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Action recognition is a highly relevant topic in computer vision with wide applications in vision systems. In this study, a multi-modality feature fusion network is proposed to combine the modalities of skeleton sequence and RGB frame, which reduces the complexity while retaining complementary information. The network employs a two-stage fusion framework to explore the correspondence between the two modalities. Experimental results on two benchmarks demonstrate competitive performance compared with state-of-the-art methods.

Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods pose a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of the skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks, NTU RGB+D and SYSU, show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reducing the complexity of the network.

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

期刊

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

期刊

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文