4.6 Article

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASLP.2023.3267833

关键词

Phonetics; Feature extraction; Task analysis; Predictive models; Speech recognition; Data mining; Training; Speaker verification; self-constraint learning; reconstruction learning; phonetic information

向作者/读者索取更多资源

In this paper, a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) is proposed to remove the influence of phonetic information on speaker embedding generation. Experimental results show that the proposed DROP-TDNN system outperforms the state-of-the-art ECAPA-TDNN system on multiple datasets.
Speech content is closely related to the stability of speaker embeddings in speaker verification tasks. In this paper, we propose a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) to remove the influence of phonetic information on speaker embedding generation. First, SCL is used to reduce the divergence of frame-level features, which can avoid ambiguity between the resulting embeddings of the two utterances being compared. Second, RT is used to further remove phonetic information in frame-level layers, focusing on speaker-discriminative feature transformation. In our experiments, the speaker embedding models were trained on the VoxCeleb2 dataset and evaluated on the VoxCeleb1, Librispeech, SITW and VoxMovies datasets. Experimental results on VoxCeleb1 show that the proposed DROP-TDNN system reduced the EER by 7.5%, compared to the state-of-the-art ECAPA-TDNN system. Furthermore, the proposed DROP-TDNN system also outperformed the ECAPA-TDNN system in the experiments on SITW, Librispeech and VoxMovies under cross-dataset conditions. In the experiments on SITW, the proposed system reduced the EER by 3.4% compared to the ECAPA-TDNN system. In the experiments on Librispeech, the proposed system demonstrated the advantage of removing phonetic information under the clean speech condition, with a significant reduction of 25.5% in EER compared to the ECAPA-TDNN system. In the experiments on VoxMovies, the proposed system reduced the EER by up to 7.9% compared to the ECAPA-TDNN system under different pronunciation and background conditions.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据