Journal
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Volume 31, Issue 1, Pages 160-174Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2020.2965574
Keywords
Cross-view action recognition; human skeleton; self-similarity; multi-stream neural network; view-invariant representation
Categories
Funding
- National Natural Science Foundation of China [61603341, 61976191, 61873220, 61773272, 61876168]
- Zhejiang Provincial Natural Science Foundation of China [LY19F030015]
- Post-doctoral Fellowship from China Scholarship Council
- Mitacs Globalink Early Career Fellowship
Ask authors/readers for more resources
This paper proposes a method to create view-invariant action descriptions using skeletal self-similarities and learning with a multi-stream neural network. By integrating skeletal self-similarities of different scales into the network, the method shows good robustness to view changes.
Existing research attention in vision-based action recognition is generally paid on recognizing actions from the same views seen in the training data. One of the big challenges in action recognition lies in the large variations of action representations as actions are captured from totally different viewpoints. This paper addresses this problem by learning view-invariant representations from skeletal self-similarities of varying scales with a very light multi-stream neural network (MSNN). As human skeletons have been proved to be an effective feature modality used for action recognition and are easy to obtain, we first create a view-invariant action description by formulating skeletal self-similarities at each frame as an image (SSI), which can show a high structural stability under view changes. Accordingly, a MSNN is designed based on 3D CNN and LSTM units to learn representations from SSIs of multiple scales, where the scheme of multiple scales provides our method with a good robustness to view changes. In addition, we integrate the computation of SSIs into the MSNN by wrapping it as a custom learnable layer thanks to its simplicity, instead of normalizing and transforming skeletons using a hand-crafted preprocessing. Extensive experimental evaluations on three challenging cross-view datasets demonstrate the effectiveness of our proposed method, which achieves superior performance to the state-of-the-art algorithms on cross-view recognition. The source code of this work will be released shortly to facilitate future studies in this field.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available