☆ 4.5 Article

The limits of the Mean Opinion Score for speech synthesis evaluation

COMPUTER SPEECH AND LANGUAGE (2024)

期刊

COMPUTER SPEECH AND LANGUAGE

卷 84, 期 -, 页码 -

出版社

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

DOI: 10.1016/j.csl.2023.101577

关键词

Speech synthesis evaluation; Absolute Category Rating; Mean Opinion Score; Blizzard Challenge

类别

Computer Science, Artificial Intelligence

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The release of WaveNet and Tacotron has greatly impacted the speech synthesis field by significantly improving the quality of synthetic speech. However, the current evaluation protocol, Absolute Category Rating (ACR) and Mean Opinion Score (MOS), used to measure this quality, has sparked controversy. To determine the reliability of MOS, a series of experiments were conducted, examining the stability of MOS over time, the influence of lower quality systems on MOS, the influence of modern technologies on past system scores, and the evolution of MOS for modern technologies in isolation. The results suggest the need for new evaluation protocols better suited for analyzing modern speech synthesis technologies.

The release of WaveNet and Tacotron has forever transformed the speech synthesis landscape. Thanks to these game-changing innovations, the quality of synthetic speech has reached unprecedented levels. However, to measure this leap in quality, an overwhelming majority of studies still rely on the Absolute Category Rating (ACR) protocol and compare systems using its output; the Mean Opinion Score (MOS). This protocol is not without controversy, and as the current state-of-the-art synthesis systems now produce outputs remarkably close to human speech, it is now vital to determine how reliable this score is.To do so, we conducted a series of four experiments replicating and following the 2013 edition of the Blizzard Challenge. With these experiments, we asked four questions about the MOS: How stable is the MOS of a system across time? How do the scores of lower quality systems influence the MOS of higher quality systems? How does the introduction of modern technologies influence the scores of past systems? How does the MOS of modern technologies evolve in isolation?The results of our experiments are manyfold. Firstly, we verify the superiority of modern technologies in comparison to historical synthesis. Then, we show that despite its origin as an absolute category rating, MOS is a relative score. While minimal variations are observed during the replication of the 2013-EH2 task, these variations can still lead to different conclusions for the intermediate systems. Our experiments also illustrate the sensitivity of MOS to the presence/absence of lower and higher anchors. Overall, our experiments suggest that we may have reached the end of a cul-de-sac by only evaluating the overall quality with MOS. We must embark on a new road and develop different evaluation protocols better suited to the analysis of modern speech synthesis technologies.

The limits of the Mean Opinion Score for speech synthesis evaluation

期刊

COMPUTER SPEECH AND LANGUAGE

出版社

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

The limits of the Mean Opinion Score for speech synthesis evaluation

期刊

COMPUTER SPEECH AND LANGUAGE

出版社

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文