☆ 4.7 Article

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

INFORMATION FUSION (2020)

期刊

INFORMATION FUSION

卷 63, 期 -, 页码 273-285

出版社

ELSEVIER

DOI: 10.1016/j.inffus.2020.04.001

关键词

Audio-Visual; Speech enhancement; Speech separation; Deep learning; Real noisy audio-visual corpus; Speaker independent; Noise-independent; Language-independent; Multi-modal; Hearing aids

类别

Computer Science, Artificial Intelligence Computer Science, Theory & Methods

资金

Edinburgh Napier University
UK Engineering and Physical Sciences Research Council (EPSRC) [EP/M026981/1]
EPSRC [EP/M026981/1] Funding Source: UKRI

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Noisy situations cause huge problems for the hearing-impaired, as hearing aids often make speech more audible but do not always restore intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress background noise and focus on the target speaker. In this paper, we present a novel language-, noise- and speaker-independent AV deep neural network (DNN) architecture, termed CochleaNet, for causal or real-time speech enhancement (SE). The model jointly exploits noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, ASPIRE, recorded in real noisy environments, including cafeteria and restaurant settings. We demonstrate superior performance of our approach in terms of both objective measures and subjective listening tests, over state-of-the-art SE approaches, including recent DNN based SE models. In addition, our work challenges a popular belief that scarcity of a mull-lingual, large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build robust language, speaker and noise-independent SE systems. We show that a model trained on a synthetic mixture of the benchmark GRID corpus (with 33 speakers and a small English vocabulary) and CHiME 3 noises (comprising bus, pedestrian, cafeteria, and street noises) can generalise well, not only on large vocabulary corpora with a wide variety of speakers and noises, but also on completely unrelated languages such as Mandarin.

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文