4.5 Article

Mask-guided explicit feature modulation for multispectral pedestrian detection

Journal

COMPUTERS & ELECTRICAL ENGINEERING
Volume 103, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.compeleceng.2022.108385

Keywords

Multispectral pedestrian detection; Feature modulation; Mutual attention; Score fusion

Funding

  1. NSF of China
  2. NSF of Jiangsu Province in China
  3. National Key Research and Development Program of China
  4. [61903164]
  5. [62173186]
  6. [BK20191427]
  7. [2021YFC2802002]

Ask authors/readers for more resources

This paper proposes a novel explicit feature modulation solution for multi-task learning of object detection and box-level segmentation. The proposed solution includes semantic feature enhancement of the backbone and confidence score enhancement of the detection head. Experimental results demonstrate that this modulation method significantly improves the performance of the tasks and achieves good results on multiple datasets.
Multi-task learning of object detection and box-level segmentation is commonly formulated as an implicit feature modulation method, which suffers lacking task interaction. In this paper, a novel explicit feature modulation solution with two different mask infusion methods is proposed. The first modulation method is semantic feature enhancement of backbone, which is achieved by a novel mask-guided mutual attention module (MMA). The proposed MMA module can explicitly guide the feature maps towards a more semantic informative direction for focalizing centrality of pedestrian, which can significantly improve the performance. The second modulation method is confidence score enhancement of detection head, which is benefited from our proposed mask-guided score fusion module (MSF). The proposed MSF module collects information from the classification, IOU, centerness feature map and the learned mask, which can discriminate false and true positives more effectively. It is qualitatively validated that the modulated feature maps in both backbone and detection head become more semantically meaningful and robust to scale and occlusion. Our method achieves a considerable gain over the state-of-the-arts on the KAIST, CVC14 and FLIR datasets. Besides, it runs at 22 FPS in default setting, making it favorable in many practical scenarios.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available