☆ 4.7 Article

Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2019)

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

卷 29, 期 8, 页码 2405-2415

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCSVT.2018.2864148

关键词

Action and activity recognition; video understanding; human analysis; visual attention

类别

Engineering, Electrical & Electronic

资金

New York State through Snap
Cheetah Mobile
NSF Award [1704309]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Action recognition with 3D skeleton sequences became popular due to its speed and robustness. The recently proposed convolutional neural networks (CNNs)-based methods show a good performance in learning spatio-temporal representations for skeleton sequences. Despite the good recognition accuracy achieved by previous CNN-based methods, there existed two problems that potentially limit the performance. First, previous skeleton representations were generated by chaining joints with a fixed order. The corresponding semantic meaning was unclear and the structural information among the joints was lost. Second, previous models did not have an ability to focus on informative joints. The attention mechanism was important for skeleton-based action recognition because different joints contributed unequally toward the correct recognition. To solve these two problems, we proposed a novel CNN-based method for skeleton-based action recognition. We first redesigned the skeleton representations with a depth-first tree traversal order, which enhanced the semantic meaning of skeleton images and better preserved the associated structural information. We then proposed the general two-branch attention architecture that automatically focused on spatio-temporal key stages and filtered out unreliable joint predictions. Based on the proposed general architecture, we designed a global long-sequence attention network with refined branch structures. Furthermore, in order to adjust the kernel's spatio-temporal aspect ratios and better capture long-term dependencies, we proposed a sub-sequence attention network (SSAN) that took sub-image sequences as inputs. We showed that the two-branch attention architecture could be combined with the SSAN to further improve the performance. Our experiment results on the NTU RGB+ D data set and the SBU kinetic interaction data set outperformed the state of the art. The model was further validated on noisy estimated poses from the subsets of the UCF101 data set and the kinetics data set.

Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文