☆ 4.6 Article

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2023)

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

卷 31, 期 -, 页码 54-70

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2022.3205757

关键词

Measurement; Predictive models; Adaptation models; Acoustic distortion; Psychoacoustic models; Speech enhancement; Acoustics; Deep learning; multi-objective learning; non-intrusive speech assessment models; speech enhancement

类别

Acoustics Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes MOSA-Net, a cross-domain multi-objective speech assessment model that can estimate speech quality, intelligibility, and distortion assessment scores simultaneously. Experimental results show that MOSA-Net improves the prediction of speech quality and short-time objective intelligibility compared to existing single-task models. Moreover, MOSA-Net can be effectively adapted to predict subjective quality and intelligibility scores with limited training data. The proposed QIA-SE approach, guided by MOSA-Net's latent representations, also outperforms the baseline SE system in terms of PESQ scores.

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文