4.6 Article

Degramnet: effective audio analysis based on a fully learnable time-frequency representation

Journal

NEURAL COMPUTING & APPLICATIONS
Volume -, Issue -, Pages -

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s00521-023-08849-7

Keywords

Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification

Ask authors/readers for more resources

Current state-of-the-art audio analysis algorithms based on deep learning often use hand-crafted Spectrogram-like audio representations which have limitations. To address these limitations, we propose a new convolutional architecture called DEGramNet, trained with a learnable time-frequency representation called DEGram. The DEGramNet achieves state-of-the-art performance on the VGGSound dataset for sound event classification and comparable accuracy with a complex approach on the VoxCeleb dataset for speaker identification.
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available