☆ 4.6 Article

Towards End-to-End Synthetic Speech Detection

IEEE SIGNAL PROCESSING LETTERS (2021)

期刊

IEEE SIGNAL PROCESSING LETTERS

卷 28, 期 -, 页码 1265-1269

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/LSP.2021.3089437

关键词

Feature extraction; Speech synthesis; Training; Mel frequency cepstral coefficient; Task analysis; Standards; Neural networks; Synthetic speech detection; speech forensics; ASVspoof2019; ASVspoof2015; cross-dataset testing; end-to-end

类别

Engineering, Electrical & Electronic

资金

2020-2021 International Scholar Exchange Fellowship (ISEF) Program at theChey Institute forAdvanced Studies, South Korea
National Natural Science Foundation of China [61802284]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces a new synthetic speech detection approach, the TSSDNet model, which utilizes end-to-end DNN and eliminates the need for hand-crafted feature extraction. Experimental results show significant performance improvement, demonstrating the potential and advantages of this model in synthetic speech detection.

The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms to facilitate synthetic speech detection, followed by either hand-crafted (subband) constant Q cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or a deep neural network (DNN) directly for further feature extraction and classification. Despite the rich literature on such a pipeline, we show in this paper that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs. Specifically, we experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge. The proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- or Inception-style structures. We further demonstrate that the proposed models also have attractive generalization capability. Trained on ASVspoof2019, they could achieve promising detection performance when tested on disjoint ASVspoof2015, significantly better than the existing cross-dataset results. This paper reveals the great potential of end-to-end DNNs for synthetic speech detection, without hand-crafted features.

Towards End-to-End Synthetic Speech Detection

期刊

IEEE SIGNAL PROCESSING LETTERS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Towards End-to-End Synthetic Speech Detection

期刊

IEEE SIGNAL PROCESSING LETTERS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文