4.6 Article

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Journal

IEEE ACCESS
Volume 9, Issue -, Pages 80500-80510

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2021.3074797

Keywords

Feature extraction; Visualization; Computer architecture; Task analysis; Convolution; Computational modeling; Spectrogram; Multi-scale architecture; audio-visual model; cascade fusion; crowd counting

Funding

  1. Guangdong Academy of Sciences' (GDAS') Project of Science and Technology Development [2017GDASCX-0115, 2018GDASCX-0115]
  2. Guangdong Academy of Science for the Special Fund of Introducing Doctoral Talent [2021GDASYL-20210103087]
  3. Opening Foundation of Xinjiang Production and Construction Corps Key Laboratory of Modern Agricultural Machinery [BTNJ2021003]

Ask authors/readers for more resources

Crowd counting is an essential computer vision application that can be effectively addressed by the AVMSN model, which utilizes cross-modal visual and audio information to handle counting tasks under low-quality conditions.
Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available