4.7 Article

DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization

Journal

NEURAL NETWORKS
Volume 133, Issue -, Pages 229-239

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.neunet.2020.10.003

Keywords

Two-stage fusion; Audio-visual tasks; Sound source separation; Sound event localization

Funding

  1. National Natural Science Foundation of China [61803107, 81971702]
  2. National Key Research and Development Project [2019YFB1804203, 2019YFB1804204]
  3. Guangdong Province Key Research and Development Project [2019B010154002]
  4. GDAS' Project of Science and Technology Development [2017GDASCX-0115, 2018GDASCX-0115]

Ask authors/readers for more resources

This paper introduces the Deep Multi-Modal Attention Network (DMMAN) model, which tackles sound separation and modal synchronization issues through a multi-modal separator and multi-modal matching classifier module. The model uses a two-stage fusion of sound and visual features, employing regression and classification losses to construct the DMMAN's loss function, leading to sound source and event localization tasks.
Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks. (C) 2020 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available