4.6 Article

Native-Nonnative Voice Conversion by Residual Warping in a Sparse, Anchor-Based Representation

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASLP.2021.3111568

关键词

Transforms; Training; Acoustics; Dictionaries; Speech processing; Frequency synthesizers; Frequency conversion; Exemplar; frequency warping; residual; sparse representation; voice conversion

资金

  1. NSF [1619212, 1623750]
  2. Direct For Computer & Info Scie & Enginr
  3. Div Of Information & Intelligent Systems [1619212, 1623750] Funding Source: National Science Foundation

向作者/读者索取更多资源

The proposed SABR+Res technique for voice conversion improves synthesis quality by transforming the source residual spectrum to match that of the target speaker, particularly excelling in native-to-nonnative speaker conversion. Additionally, it received favorable evaluations in subjective tests.
Voice conversion (VC) techniques can be used to synthesize utterances from second language learners to appear as if they have a native accent, providing learners with an ideal target to imitate in pronunciation training. In prior work, we presented a low-resource technique called SABR (Sparse, Anchor-Based Representation of Speech), which uses acoustic anchors-one per English phoneme-to represent an utterance as a sparse, linear combination of nonnegative weights. SABR produces intelligible speech, but its compact size limits the acoustic quality of the synthesis, in large part due to the significant residual left out by the compact model. In this article, we propose SABR+Res, which uses a linear combination of frequency warp transforms to convert the source residual spectrum to be closer to that of the target speaker and use it in synthesis. We evaluate the proposed method on speakers from the ARCTIC and L2-ARCTIC databases and compare them to state-of-the-art exemplar and frequency-warping VC methods. We find that SABR+Res had the lowest objective VC error for native-to-nonnative conversion and was preferred in subjective tests. Additionally, when compared to the baseline systems, SABR+Res had a much higher synthesis quality on native-to-nonnative speakers, performing similarly to native-to-native speaker pairs. We discuss the implications for the residual warping system and applying the residual transform to other exemplar-based systems.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据