4.6 Article

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Journal

SENSORS
Volume 22, Issue 18, Pages -

Publisher

MDPI
DOI: 10.3390/s22186818

Keywords

sound event detection; temporal-frequency attention; feature space attention; convolutional recurrent neural networks; feature aggregation

Funding

  1. National Natural Science Foundation of China [62071135]
  2. Project of Guangxi Technology Base and Talent Special Project [GuiKe AD20159018]
  3. Project of Guangxi Natural Science Foundation [2020GXNSFAA159004]
  4. Fund of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education [CRKL200104]
  5. Opening Project of Guangxi Key Laboratory of UAV Remote Sensing [WRJ2016KF01]

Ask authors/readers for more resources

This paper proposes a TFFS-CRNN model based on TF attention mechanism and FS attention mechanism, which improves feature representation in polyphonic sound event detection. By using two attention modules, it can focus on important features, and experiments show better performance in the DCASE challenge.
The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal-frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available