4.7 Article

Pose focus transformer meet inter-part relation

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 240, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.122476

Keywords

Human pose estimation; Crowded scene; Inter-part relation; Transformer

Ask authors/readers for more resources

Human pose estimation in crowded scenes is challenging due to overlap and occlusion. We proposed PFFormer, a new transformer-based approach that treats pose estimation as a hierarchical set prediction problem. PFFormer focuses on human windows and coarsely predicts whole-body poses globally within them. It uses Windows Clustering Transformer and a global transformer to filter out interference from the background and capture inter-part correlation. Experimental results demonstrate the robustness of PFFormer in handling occlusion in crowded scenes.
Human pose estimation in crowded scenes is a challenging task. Due to overlap and occlusion, it is difficult to infer pose clues from individual keypoints. We proposed PFFormer, a new transformer-based approach that treats pose estimation as a hierarchical set prediction problem that first focuses on human windows and coarsely predicts whole-body poses globally within them. In PFFormer, we designed a Windows Clustering Transformer (WCT), which reorganizes the image windows by filtering the attentive windows and fusing the inattentive ones, allowing the transformer to focus on the important regions while reducing the interference from the complex background, followed by compensating for the loss of information with a global transformer. Then we partition the learned body pose into a set of structural parts and perform the Inter-Part Relation Module (IPRM) to capture the correlation between multiple parts. These full-body poses and component features are refined at a finer level through the Part-to-Joint Decoder (PJD). Extensive experiments show that PFFormer performs favorably against its counterpart on challenging datasets, including COCO2017, CrowdPose, and OChuman datasets. The performance of crowded scenes, in particular, demonstrates the robustness of the proposed methods to deal with occlusion.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available