期刊
SPEECH COMMUNICATION
卷 140, 期 -, 页码 29-41出版社
ELSEVIER
DOI: 10.1016/j.specom.2022.03.004
关键词
Multi-channel speaker separation; Beamforming; Dereverberation; Speaker identification; Triplet mining
资金
- Austrian Science Fund (FWF) [P27803-N15]
- Austrian Science Fund (FWF) [P27803] Funding Source: Austrian Science Fund (FWF)
The paper introduces the BSSD network, which achieves speaker separation, dereverberation, and speaker identification simultaneously. Various techniques like predefined spatial cues, neural beamforming, embedding vectors, and triplet mining are utilized for these tasks. The system is evaluated based on SI-SDR, WER, and EER metrics.
In this paper, we present the Blind Speech Separation and Dereverberation (BSSD) network, which performs simultaneous speaker separation, dereverberation and speaker identification in a single neural network. Speaker separation is guided by a set of predefined spatial cues. Dereverberation is performed by using neural beamforming, and speaker identification is aided by embedding vectors and triplet mining. We introduce a frequency-domain model which uses complex-valued neural networks, and a time-domain variant which performs beamforming in latent space. Further, we propose a block-online mode to process longer audio recordings, as they occur in meeting scenarios. We evaluate our system in terms of Scale Independent Signal to Distortion Ratio (SI-SDR), Word Error Rate (WER) and Equal Error Rate (EER).
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据