☆ 4.5 Article

Learning and controlling the source-filter representation of speech with a variational autoencoder

SPEECH COMMUNICATION (2023)

期刊

SPEECH COMMUNICATION

卷 148, 期 -, 页码 53-65

出版社

ELSEVIER

DOI: 10.1016/j.specom.2023.02.005

关键词

Representation learning; Deep generative models; Variational autoencoder; Source-filter model

类别

Acoustics Computer Science, Interdisciplinary Applications

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Understanding and controlling latent representations in deep generative models is a challenging yet important problem. In this work, the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. A method is proposed to identify and control the source-filter speech factors within the latent subspaces, as well as a robust f(0) estimation method.

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f(0) and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding f(0) and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f(0) and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust f(0) estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with f(0).

Learning and controlling the source-filter representation of speech with a variational autoencoder

期刊

SPEECH COMMUNICATION

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Learning and controlling the source-filter representation of speech with a variational autoencoder

期刊

SPEECH COMMUNICATION

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文