☆ 4.4 Article

Sign Language Recognition Based on R(2+1)D With Spatial-Temporal-Channel Attention

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS (2022)

期刊

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS

卷 52, 期 4, 页码 687-698

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/THMS.2022.3144000

关键词

Convolution; Feature extraction; Videos; Hidden Markov models; Gesture recognition; Task analysis; Spatiotemporal phenomena; Attention mechanism; R(2+1)D; sign language recognition (SLR)

类别

Computer Science, Artificial Intelligence Computer Science, Cybernetics

资金

NationalNatural Science Foundation of China [61973187, 61773239]
Guangdong Province Basic and Applied Basic Research Fund Project [2019A1515110175]
Research Grants Council of Hong Kong [21212720]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

In this study, a deep R(2+1)D model was adopted for sign language recognition, which improved the optimization process by separating spatial and temporal modeling. Additionally, a lightweight spatial-temporal-channel attention module was proposed to concentrate the network on significant information. Experimental results demonstrated the effectiveness of the proposed method, achieving superior or comparable results to state-of-the-art methods.

Previous work utilized three-dimensional (3-D) convolutional neural networks (CNNs) tomodel the spatial appearance and temporal evolution concurrently for sign language recognition (SLR) and exhibited impressive performance. However, there are still challenges for 3-D CNN-based methods. First, motion information plays a more significant role than spatial content in sign language. Therefore, it is still questionable whether to treat space and time equally and model them jointly by heavy 3-D convolutions in a unified approach. Second, because of the interference from the highly redundant information in sign videos, it is still nontrivial to effectively extract discriminative spatiotemporal features related to sign language. In this study, deep R(2+1)D was adopted for separate spatial and temporal modeling and demonstrated that decomposing 3-D convolution filters into independent spatial and temporal convolutions facilitates the optimization process in SLR. A lightweight spatial-temporal-channel attention module, including two submodules called channel-temporal attention and spatial-temporal attention, was proposed to make the network concentrate on the significant information along spatial, temporal, and channel dimensions by combining squeeze and excitation attention with self-attention. By embedding this module into R(2+1)D, superior or comparable results to the state-of-the-art methods on the CSL-500, Jester, and EgoGesture datasets were obtained, which demonstrated the effectiveness of the proposed method.

Sign Language Recognition Based on R(2+1)D With Spatial-Temporal-Channel Attention

期刊

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Sign Language Recognition Based on R(2+1)D With Spatial-Temporal-Channel Attention

期刊

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文