4.7 Article

Learning Representations From Skeletal Self-Similarities for Cross-View Action Recognition

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2020.2965574

Keywords

Cross-view action recognition; human skeleton; self-similarity; multi-stream neural network; view-invariant representation

Funding

  1. National Natural Science Foundation of China [61603341, 61976191, 61873220, 61773272, 61876168]
  2. Zhejiang Provincial Natural Science Foundation of China [LY19F030015]
  3. Post-doctoral Fellowship from China Scholarship Council
  4. Mitacs Globalink Early Career Fellowship

Ask authors/readers for more resources

This paper proposes a method to create view-invariant action descriptions using skeletal self-similarities and learning with a multi-stream neural network. By integrating skeletal self-similarities of different scales into the network, the method shows good robustness to view changes.
Existing research attention in vision-based action recognition is generally paid on recognizing actions from the same views seen in the training data. One of the big challenges in action recognition lies in the large variations of action representations as actions are captured from totally different viewpoints. This paper addresses this problem by learning view-invariant representations from skeletal self-similarities of varying scales with a very light multi-stream neural network (MSNN). As human skeletons have been proved to be an effective feature modality used for action recognition and are easy to obtain, we first create a view-invariant action description by formulating skeletal self-similarities at each frame as an image (SSI), which can show a high structural stability under view changes. Accordingly, a MSNN is designed based on 3D CNN and LSTM units to learn representations from SSIs of multiple scales, where the scheme of multiple scales provides our method with a good robustness to view changes. In addition, we integrate the computation of SSIs into the MSNN by wrapping it as a custom learnable layer thanks to its simplicity, instead of normalizing and transforming skeletons using a hand-crafted preprocessing. Extensive experimental evaluations on three challenging cross-view datasets demonstrate the effectiveness of our proposed method, which achieves superior performance to the state-of-the-art algorithms on cross-view recognition. The source code of this work will be released shortly to facilitate future studies in this field.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available