期刊
出版社
IEEE
DOI: 10.1109/IJCNN55064.2022.9892512
关键词
Text-to-speech; non-autoregressive; fast; controllable
类别
资金
- Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan [WT135-38, 2020AAA0107904]
Inspired by the success of FastSpeech, this paper proposes FCH-TTS, a fast, controllable, and universal neural text-to-speech model that can generate high-quality spectrograms. Unlike FastSpeech, FCH-TTS uses a simpler attention-based soft alignment mechanism to improve its adaptability to different languages. It also introduces a fusion module to better model speaker features and ensure the desired timbre. Experimental results demonstrate that FCH-TTS achieves the fastest inference speed and the best speech quality compared to baseline models.
Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output melspectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据