☆ 4.6 Article

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

SENSORS (2022)

Journal

SENSORS

Volume 22, Issue 18, Pages -

Publisher

MDPI

DOI: 10.3390/s22186818

Keywords

sound event detection; temporal-frequency attention; feature space attention; convolutional recurrent neural networks; feature aggregation

Funding

National Natural Science Foundation of China [62071135]
Project of Guangxi Technology Base and Talent Special Project [GuiKe AD20159018]
Project of Guangxi Natural Science Foundation [2020GXNSFAA159004]
Fund of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education [CRKL200104]
Opening Project of Guangxi Key Laboratory of UAV Remote Sensing [WRJ2016KF01]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a TFFS-CRNN model based on TF attention mechanism and FS attention mechanism, which improves feature representation in polyphonic sound event detection. By using two attention modules, it can focus on important features, and experiments show better performance in the DCASE challenge.

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal-frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Journal

SENSORS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Journal

SENSORS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper