4.7 Article

3D human pose estimation with cross-modality training and multi-scale local refinement

Journal

APPLIED SOFT COMPUTING
Volume 122, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.asoc.2022.108950

Keywords

3D human pose estimation; Deep convolutional neural network; Batch normalization; Normal vector; Multi-scale local refinement

Funding

  1. National Natural Science Foundation of China [61502187, 61876211]
  2. Equipment Pre-research Field Fund of China [61403120405]
  3. National Key Laboratory Open Fund of China [614211318 0211]
  4. Fundamental Research Funds for the Central Universities [2019kfyXKJC024]
  5. National Key R&D Program of China [2018YFB1004600]
  6. SERC Central Research Fund
  7. Singapore government's Research, and Innovation and Enterprise 2020 plan [A18A1b0045]

Ask authors/readers for more resources

This paper aims to address two major research problems in 3D human pose estimation using depth data. Firstly, it proposes a crossmodality CNN training strategy to transfer RGB annotation information to the depth domain. Secondly, it introduces a multi-scale local refinement network to effectively handle the optimal local observation scale.
This paper aims to study two major research problems for 3D human pose estimation using depth data. First we seek an effective way for applying the RGB pre-trained 2D CNN model to 3D pose field, so as to transfer large-scale RGB annotation information to depth domain. In particular, we proposed a crossmodality CNN training strategy, where the key idea is to set a partial Batch Normalization (BN) layer within the RGB pre-trained 2D CNN model to weaken the distribution divergence between the RGB and depth data during training. To involve richer 3D descriptive cues, the raw depth data is appended with the normal vector map. Albeit coarse-to-fine human pose estimation with local refinement is helpful to enhance performance. While the way for setting the optimal local observation scale is not well addressed. Towards this crucial problem, we propose to fuse the multi-scale local information jointly. A multi-scale local refinement network is proposed accordingly, where the small local region focuses on capturing the fine information. On the other hand, the large local region contains richer semantic contextual information. The experiments on two 3D human pose estimation datasets with depth data verify the effectiveness and real-time running capacity of our proposition. (c) 2022 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available