4.7 Article

Multi-target ensemble learning based speech enhancement with temporal-spectral structured target

Journal

APPLIED ACOUSTICS
Volume 205, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.apacoust.2023.109268

Keywords

Speech enhancement; Temporal -spectral structured target; Multi -target ensemble learning; Sparse nonnegative matrix factorization

Categories

Ask authors/readers for more resources

This paper proposes a novel structured multi-target ensemble learning (SMTEL) framework to improve speech quality and intelligibility. The method captures the basis matrices of clean speech, noise, and ideal ratio mask (IRM) using sparse nonnegative matrix factorization and co-trains them with a multi-target DNN. A joint training single layer perceptron is then proposed to integrate the two targets and further enhance speech quality and intelligibility. Experimental results show that the proposed method achieves the best enhancement effect in visible nonstationary noise environment with low network cost and complexity.
Recently, deep neural network (DNN)-based speech enhancement has shown considerable success, and mapping-based and masking-based are the two most commonly used methods. However, these methods do not consider the spectrum structures of signal. In this paper, a novel structured multi-target ensemble learning (SMTEL) framework is proposed, which uses target temporal-spectral structures to improve speech quality and intelligibility. First, the basis matrices of clean speech, noise, and ideal ratio mask (IRM) are captured by the sparse nonnegative matrix factorization, which contain the basic structures of the signal. Second, the basis matrices are co-trained with a multi-target DNN to estimate the activation matrices instead of directly estimating the targets. Then a joint training single layer perceptron is pro-posed to integrate the two targets and further improve speech quality and intelligibility. The sequential floating forward selection method is used to systematically analyze the impact of the integrated targets on enhanced performance, and analyze the effect of the target weights on the results. Finally, the pro-posed method with progressive learning is combined to improve the enhanced performance. Systematic experiments on the UW/NU corpus show that the proposed method achieves the best enhancement effect in the case of low network cost and complexity, especially in visible nonstationary noise environment. Compared with the target integration method which does not use structured targets and the long short-term memory masking method, the speech quality of the proposed method is improved by 25.6 % and 29.2 % of restaurant noise, and the speech intelligibility is improved by 35.5 % and 15.8 %, respectively.(c) 2023 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available