4.6 Article

What-Where-When Attention Network for video-based person re-identification

Journal

NEUROCOMPUTING
Volume 468, Issue -, Pages 33-47

Publisher

ELSEVIER
DOI: 10.1016/j.neucom.2021.10.018

Keywords

Person re-identification; What-Where-When Attention; Spatial-temporal feature; Graph attention network; Attribute; Identity

Funding

  1. National Natural Science Foundation of China [61801437, 61871351, 61971381, 61461025, 61871259, 61811530325 (IECnNSFCn170396), 61861024]
  2. Natural Science Foundation of Shanxi Province [201801D221206, 201801D221207]
  3. Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi [2020L0683]
  4. Key research and development plan of Luliang City [2020GXZDYF21]

Ask authors/readers for more resources

Video-based person re-identification is critical in intelligent video surveillance, and existing methods often use attention mechanisms to address challenging variations. However, these methods mainly focus on occlusion, neglecting other important spatial information and temporal cues in video frames. This paper proposes a comprehensive attention mechanism, W3AN, which effectively learns discriminative spatial-temporal features for person re-identification. The experimental results demonstrate the effectiveness of W3AN model and the contributions of major modules are clarified in the discussion.
Video-based person re-identification plays a critical role in intelligent video surveillance by learning temporal correlations from consecutive video frames. Most existing methods aim to solve the challenging variations of pose, occlusion, backgrounds and so on by using attention mechanism. They almost all draw attention to the occlusion and learn occlusion-invariant video representations by abandoning the occluded area or frames, while the other areas in these frames contain sufficient spatial information and temporal cues. To overcome these drawbacks, this paper proposes a comprehensive attention mechanism covering what, where, and when to pay attention in the discriminative spatial-temporal feature learning, namely What-Where-When Attention Network (W3AN). Concretely, W3AN designs a spatial attention module to focus on pedestrian identity and obvious attributes by the importance estimating layer (What and Where), and a temporal attention module to calculate the frame-level importance (when), which is embedded into a graph attention network to exploit temporal attention features rather than computing weighted average feature for video frames like existing methods. Moreover, the experiments on three widely-recognized datasets demonstrate the effectiveness of our proposed W3AN model and the discussion of major modules elaborates the contributions of this paper. (c) 2021 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available