☆ 4.6 Article

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2023)

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

卷 31, 期 -, 页码 1745-1757

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2023.3267833

关键词

Phonetics; Feature extraction; Task analysis; Predictive models; Speech recognition; Data mining; Training; Speaker verification; self-constraint learning; reconstruction learning; phonetic information

类别

Acoustics Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

In this paper, a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) is proposed to remove the influence of phonetic information on speaker embedding generation. Experimental results show that the proposed DROP-TDNN system outperforms the state-of-the-art ECAPA-TDNN system on multiple datasets.

Speech content is closely related to the stability of speaker embeddings in speaker verification tasks. In this paper, we propose a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) to remove the influence of phonetic information on speaker embedding generation. First, SCL is used to reduce the divergence of frame-level features, which can avoid ambiguity between the resulting embeddings of the two utterances being compared. Second, RT is used to further remove phonetic information in frame-level layers, focusing on speaker-discriminative feature transformation. In our experiments, the speaker embedding models were trained on the VoxCeleb2 dataset and evaluated on the VoxCeleb1, Librispeech, SITW and VoxMovies datasets. Experimental results on VoxCeleb1 show that the proposed DROP-TDNN system reduced the EER by 7.5%, compared to the state-of-the-art ECAPA-TDNN system. Furthermore, the proposed DROP-TDNN system also outperformed the ECAPA-TDNN system in the experiments on SITW, Librispeech and VoxMovies under cross-dataset conditions. In the experiments on SITW, the proposed system reduced the EER by 3.4% compared to the ECAPA-TDNN system. In the experiments on Librispeech, the proposed system demonstrated the advantage of removing phonetic information under the clean speech condition, with a significant reduction of 25.5% in EER compared to the ECAPA-TDNN system. In the experiments on VoxMovies, the proposed system reduced the EER by up to 7.9% compared to the ECAPA-TDNN system under different pronunciation and background conditions.

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文