4.6 Article

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

期刊

IEEE ACCESS
卷 10, 期 -, 页码 5895-5911

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2022.3141200

关键词

Adaptation models; Transfer learning; Training; Task analysis; Deep learning; Data models; Training data; Deep neural network; low-resource; multi-speaker; multilingual; partial network-based deep transfer learning; speaker reconstruction loss; style control; text-to-speech; zero-shot speaker adaptation

资金

  1. Pendampingan Publikasi Internasional (PPI) Q1 from Universitas Indonesia [NKB-547/UN2.RST/HKP.05.00/2021]
  2. NVIDIA DGX-1 computing facilities at the Tokopedia-UI AI Center of Excellence

向作者/读者索取更多资源

This study proposes a novel training strategy and speech synthesis model to address the issues of data scarcity in low-resource languages and unsatisfactory performance in zero-shot speaker adaptation. Through the use of multi-stage transfer learning and explicit style control, the proposed model successfully improves the intelligibility of synthesized speech and speaker similarity.
Deep neural network (DNN)-based systems generally require large amounts of training data, so they have data scarcity problems in low-resource languages. Recent studies have succeeded in building zero-shot multi-speaker DNN-based TTS on high-resource languages, but they still have unsatisfactory performance on unseen speakers. This study addresses two main problems: overcoming the problem of data scarcity in the DNN-based TTS on low-resource languages and improving the performance of zero-shot speaker adaptation for unseen speakers. We propose a novel multi-stage transfer learning strategy using a partial network-based deep transfer learning to overcome the low-resource problem by utilizing pre-trained monolingual single-speaker TTS and d-vector speaker encoder on a high-resource language as the source domain. Meanwhile, to improve the performance of zero-shot speaker adaptation, we propose a new TTS model that incorporates an explicit style control from the target speaker for TTS conditioning and an utterance-level speaker reconstruction loss during TTS training. We use publicly available speech datasets for experiments. We show that our proposed training strategy is able to effectively train the TTS models using a limited amount of training data of low-resource target languages. The models trained using the proposed transfer learning successfully produce intelligible natural speech sounds, while in contrast using standard training fails to make the models synthesize understandable speech. We also demonstrate that our proposed style encoder network and speaker reconstruction loss significantly improves speaker similarity in zero-shot speaker adaptation task compared to the baseline model. Overall, our proposed TTS model and training strategy has succeeded in increasing the speaker cosine similarity of the synthesized speech on the unseen speakers test set by 0.468 and 0.279 in native and foreign languages respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据