4.6 Article

Jointly Trained Conversion Model With LPCNet for Any-to-One Voice Conversion Using Speaker-Independent Linguistic Features

Journal

IEEE ACCESS
Volume 10, Issue -, Pages 134029-134037

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2022.3226350

Keywords

Automatic speech recognition; conversion model; joint training; neural vocoder; voice conversion

Ask authors/readers for more resources

This study proposes a joint training scheme for an any-to-one voice conversion system with LPCNet to enhance the naturalness, speaker similarity, and intelligibility of converted speech. By incorporating speaker-independent features derived from an automatic speech recognition model, the conversion model accurately captures the linguistic contents of the given utterance and maps them to the acoustic representations used by LPCNet. Experimental results demonstrate that the proposed model enables real-time voice conversion and outperforms existing state-of-the-art approaches.
We propose a joint training scheme of an any-to-one voice conversion (VC) system with LPCNet to improve the speech naturalness, speaker similarity, and intelligibility of the converted speech. Recent advancements in neural-based vocoders, such as LPCNet, have enabled the production of more natural and clear speech. However, other components in typical VC systems are often designed independently, such as the conversion model. Hence, separate training strategies are used for each component that is not in direct correlation to the training objective of the vocoder preventing exploitation of the full potential of LPCNet. This problem is addressed by proposing a jointly trained conversion model and LPCNet. To accurately capture the linguistic contents of the given utterance, we use speaker-independent (SI) features derived from an automatic speech recognition (ASR) model trained using a mixed-language speech corpus. Subsequently, a conversion model maps the SI features to the acoustic representations used as input features to LPCNet. The possibility to synthesize cross-language speech using the proposed approach is also explored in this paper. Experimental results show that the proposed model can achieve real-time VC, unlocking the full potential of LPCNet and outperforming the state of the art.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available