☆ 4.6 Article

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

NEUROCOMPUTING (2021)

期刊

NEUROCOMPUTING

卷 456, 期 -, 页码 49-60

出版社

ELSEVIER

DOI: 10.1016/j.neucom.2021.05.065

关键词

Speech intelligibility; LSTM; Attention; Acoustic spectrogram; Modulation spectrogram; Fusion

类别

Computer Science, Artificial Intelligence

资金

Spanish Government-MinECo [TEC2017-84395-P, TEC201784593C21R]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study presents an automatic prediction system for speech intelligibility level using LSTM networks and attention mechanism. Two main contributions are proposed: using per-frame modulation spectrograms as input features, and exploring two different strategies for combining per-frame log-mel and modulation spectrograms in the LSTM framework. Results show that attentional LSTM networks can effectively model modulation spectrograms and the combination strategies outperform single-feature systems.

Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results. (c) 2021 Elsevier B.V. All rights reserved.

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

期刊

NEUROCOMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

期刊

NEUROCOMPUTING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文