☆ 3.8 Proceedings Paper

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) (2022)

期刊

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

卷 -, 期 -, 页码 -

出版社

IEEE

DOI: 10.1109/IJCNN55064.2022.9892512

关键词

Text-to-speech; non-autoregressive; fast; controllable

类别

Computer Science, Artificial Intelligence Computer Science, Hardware & Architecture Engineering, Electrical & Electronic Neurosciences

资金

Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan [WT135-38, 2020AAA0107904]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Inspired by the success of FastSpeech, this paper proposes FCH-TTS, a fast, controllable, and universal neural text-to-speech model that can generate high-quality spectrograms. Unlike FastSpeech, FCH-TTS uses a simpler attention-based soft alignment mechanism to improve its adaptability to different languages. It also introduces a fusion module to better model speaker features and ensure the desired timbre. Experimental results demonstrate that FCH-TTS achieves the fastest inference speed and the best speech quality compared to baseline models.

Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output melspectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

期刊

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

期刊

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文