4.6 Article

Multi-Target Extractor and Detector for Unknown-Number Speaker Diarization

期刊

IEEE SIGNAL PROCESSING LETTERS
卷 30, 期 -, 页码 638-642

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LSP.2023.3279781

关键词

Detectors; Training; Feature extraction; Mixers; Hidden Markov models; Oral communication; Data mining; Speaker diarization; speaker representations

向作者/读者索取更多资源

This study proposes a neural architecture that extracts speaker representations and detects the presence of each speaker on a frame-by-frame basis, regardless of the number of speakers in a conversation. The model outperforms previous methods in tests on the CALLHOME corpus and achieves significant diarization error rate reductions in a more challenging case with simultaneous speakers ranging from 2 to 7.
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据