☆ 4.6 Article

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

IEEE ACCESS (2022)

期刊

IEEE ACCESS

卷 10, 期 -, 页码 5895-5911

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2022.3141200

关键词

Adaptation models; Transfer learning; Training; Task analysis; Deep learning; Data models; Training data; Deep neural network; low-resource; multi-speaker; multilingual; partial network-based deep transfer learning; speaker reconstruction loss; style control; text-to-speech; zero-shot speaker adaptation

类别

Computer Science, Information Systems Engineering, Electrical & Electronic Telecommunications

资金

Pendampingan Publikasi Internasional (PPI) Q1 from Universitas Indonesia [NKB-547/UN2.RST/HKP.05.00/2021]
NVIDIA DGX-1 computing facilities at the Tokopedia-UI AI Center of Excellence

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes a novel training strategy and speech synthesis model to address the issues of data scarcity in low-resource languages and unsatisfactory performance in zero-shot speaker adaptation. Through the use of multi-stage transfer learning and explicit style control, the proposed model successfully improves the intelligibility of synthesized speech and speaker similarity.

Deep neural network (DNN)-based systems generally require large amounts of training data, so they have data scarcity problems in low-resource languages. Recent studies have succeeded in building zero-shot multi-speaker DNN-based TTS on high-resource languages, but they still have unsatisfactory performance on unseen speakers. This study addresses two main problems: overcoming the problem of data scarcity in the DNN-based TTS on low-resource languages and improving the performance of zero-shot speaker adaptation for unseen speakers. We propose a novel multi-stage transfer learning strategy using a partial network-based deep transfer learning to overcome the low-resource problem by utilizing pre-trained monolingual single-speaker TTS and d-vector speaker encoder on a high-resource language as the source domain. Meanwhile, to improve the performance of zero-shot speaker adaptation, we propose a new TTS model that incorporates an explicit style control from the target speaker for TTS conditioning and an utterance-level speaker reconstruction loss during TTS training. We use publicly available speech datasets for experiments. We show that our proposed training strategy is able to effectively train the TTS models using a limited amount of training data of low-resource target languages. The models trained using the proposed transfer learning successfully produce intelligible natural speech sounds, while in contrast using standard training fails to make the models synthesize understandable speech. We also demonstrate that our proposed style encoder network and speaker reconstruction loss significantly improves speaker similarity in zero-shot speaker adaptation task compared to the baseline model. Overall, our proposed TTS model and training strategy has succeeded in increasing the speaker cosine similarity of the synthesized speech on the unseen speakers test set by 0.468 and 0.279 in native and foreign languages respectively.

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

期刊

IEEE ACCESS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

期刊

IEEE ACCESS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文