☆ 3.8 Proceedings Paper

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) (2022)

Journal

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Volume -, Issue -, Pages -

Publisher

IEEE

DOI: 10.1109/IJCNN55064.2022.9892512

Keywords

Text-to-speech; non-autoregressive; fast; controllable

Funding

Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan [WT135-38, 2020AAA0107904]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Inspired by the success of FastSpeech, this paper proposes FCH-TTS, a fast, controllable, and universal neural text-to-speech model that can generate high-quality spectrograms. Unlike FastSpeech, FCH-TTS uses a simpler attention-based soft alignment mechanism to improve its adaptability to different languages. It also introduces a fusion module to better model speaker features and ensure the desired timbre. Experimental results demonstrate that FCH-TTS achieves the fastest inference speed and the best speech quality compared to baseline models.

Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output melspectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

Journal

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Publisher

IEEE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

Journal

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Publisher

IEEE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper