☆ 4.6 Article

Native-Nonnative Voice Conversion by Residual Warping in a Sparse, Anchor-Based Representation

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2021)

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

卷 29, 期 -, 页码 3040-3051

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2021.3111568

关键词

Transforms; Training; Acoustics; Dictionaries; Speech processing; Frequency synthesizers; Frequency conversion; Exemplar; frequency warping; residual; sparse representation; voice conversion

类别

Acoustics Engineering, Electrical & Electronic

资金

NSF [1619212, 1623750]
Direct For Computer & Info Scie & Enginr
Div Of Information & Intelligent Systems [1619212, 1623750] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The proposed SABR+Res technique for voice conversion improves synthesis quality by transforming the source residual spectrum to match that of the target speaker, particularly excelling in native-to-nonnative speaker conversion. Additionally, it received favorable evaluations in subjective tests.

Voice conversion (VC) techniques can be used to synthesize utterances from second language learners to appear as if they have a native accent, providing learners with an ideal target to imitate in pronunciation training. In prior work, we presented a low-resource technique called SABR (Sparse, Anchor-Based Representation of Speech), which uses acoustic anchors-one per English phoneme-to represent an utterance as a sparse, linear combination of nonnegative weights. SABR produces intelligible speech, but its compact size limits the acoustic quality of the synthesis, in large part due to the significant residual left out by the compact model. In this article, we propose SABR+Res, which uses a linear combination of frequency warp transforms to convert the source residual spectrum to be closer to that of the target speaker and use it in synthesis. We evaluate the proposed method on speakers from the ARCTIC and L2-ARCTIC databases and compare them to state-of-the-art exemplar and frequency-warping VC methods. We find that SABR+Res had the lowest objective VC error for native-to-nonnative conversion and was preferred in subjective tests. Additionally, when compared to the baseline systems, SABR+Res had a much higher synthesis quality on native-to-nonnative speakers, performing similarly to native-to-native speaker pairs. We discuss the implications for the residual warping system and applying the residual transform to other exemplar-based systems.

Native-Nonnative Voice Conversion by Residual Warping in a Sparse, Anchor-Based Representation

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Native-Nonnative Voice Conversion by Residual Warping in a Sparse, Anchor-Based Representation

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文