3.8 Proceedings Paper

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

Publisher

IEEE
DOI: 10.1109/IJCNN55064.2022.9892512

Keywords

Text-to-speech; non-autoregressive; fast; controllable

Funding

  1. Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan [WT135-38, 2020AAA0107904]

Ask authors/readers for more resources

Inspired by the success of FastSpeech, this paper proposes FCH-TTS, a fast, controllable, and universal neural text-to-speech model that can generate high-quality spectrograms. Unlike FastSpeech, FCH-TTS uses a simpler attention-based soft alignment mechanism to improve its adaptability to different languages. It also introduces a fusion module to better model speaker features and ensure the desired timbre. Experimental results demonstrate that FCH-TTS achieves the fastest inference speed and the best speech quality compared to baseline models.
Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output melspectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available