4.5 Article

The limits of the Mean Opinion Score for speech synthesis evaluation

期刊

COMPUTER SPEECH AND LANGUAGE
卷 84, 期 -, 页码 -

出版社

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD
DOI: 10.1016/j.csl.2023.101577

关键词

Speech synthesis evaluation; Absolute Category Rating; Mean Opinion Score; Blizzard Challenge

向作者/读者索取更多资源

The release of WaveNet and Tacotron has greatly impacted the speech synthesis field by significantly improving the quality of synthetic speech. However, the current evaluation protocol, Absolute Category Rating (ACR) and Mean Opinion Score (MOS), used to measure this quality, has sparked controversy. To determine the reliability of MOS, a series of experiments were conducted, examining the stability of MOS over time, the influence of lower quality systems on MOS, the influence of modern technologies on past system scores, and the evolution of MOS for modern technologies in isolation. The results suggest the need for new evaluation protocols better suited for analyzing modern speech synthesis technologies.
The release of WaveNet and Tacotron has forever transformed the speech synthesis landscape. Thanks to these game-changing innovations, the quality of synthetic speech has reached unprecedented levels. However, to measure this leap in quality, an overwhelming majority of studies still rely on the Absolute Category Rating (ACR) protocol and compare systems using its output; the Mean Opinion Score (MOS). This protocol is not without controversy, and as the current state-of-the-art synthesis systems now produce outputs remarkably close to human speech, it is now vital to determine how reliable this score is.To do so, we conducted a series of four experiments replicating and following the 2013 edition of the Blizzard Challenge. With these experiments, we asked four questions about the MOS: How stable is the MOS of a system across time? How do the scores of lower quality systems influence the MOS of higher quality systems? How does the introduction of modern technologies influence the scores of past systems? How does the MOS of modern technologies evolve in isolation?The results of our experiments are manyfold. Firstly, we verify the superiority of modern technologies in comparison to historical synthesis. Then, we show that despite its origin as an absolute category rating, MOS is a relative score. While minimal variations are observed during the replication of the 2013-EH2 task, these variations can still lead to different conclusions for the intermediate systems. Our experiments also illustrate the sensitivity of MOS to the presence/absence of lower and higher anchors. Overall, our experiments suggest that we may have reached the end of a cul-de-sac by only evaluating the overall quality with MOS. We must embark on a new road and develop different evaluation protocols better suited to the analysis of modern speech synthesis technologies.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据