☆ 4.7 Article

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

SCIENTIFIC REPORTS (2022)

Journal

SCIENTIFIC REPORTS

Volume 12, Issue 1, Pages -

Publisher

NATURE PORTFOLIO

DOI: 10.1038/s41598-022-17863-z

Keywords

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

In this paper, a novel approach based on a multi-branch three-dimensional (3D) convolution neural network (CNN) model is proposed for accurate acoustic scene classification (ASC). Multiple frequency-domain representations of signals are formed by utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). The proposed 3D CNN architecture, featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet), effectively captures both long-term and short-term correlations in environmental sounds. Additionally, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated to alleviate overfitting risks. Numerical evaluation on a large-scale dataset demonstrates the superior performance of the proposed multi-input multi-feature 3D-CNN architecture over state-of-the-art methods.

As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial-temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

Journal

SCIENTIFIC REPORTS

Publisher

NATURE PORTFOLIO

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

Journal

SCIENTIFIC REPORTS

Publisher

NATURE PORTFOLIO

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper