☆ 4.6 Article

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

IEEE ACCESS (2021)

Journal

IEEE ACCESS

Volume 9, Issue -, Pages 80500-80510

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2021.3074797

Keywords

Feature extraction; Visualization; Computer architecture; Task analysis; Convolution; Computational modeling; Spectrogram; Multi-scale architecture; audio-visual model; cascade fusion; crowd counting

Funding

Guangdong Academy of Sciences' (GDAS') Project of Science and Technology Development [2017GDASCX-0115, 2018GDASCX-0115]
Guangdong Academy of Science for the Special Fund of Introducing Doctoral Talent [2021GDASYL-20210103087]
Opening Foundation of Xinjiang Production and Construction Corps Key Laboratory of Modern Agricultural Machinery [BTNJ2021003]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Crowd counting is an essential computer vision application that can be effectively addressed by the AVMSN model, which utilizes cross-modal visual and audio information to handle counting tasks under low-quality conditions.

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper