☆ 3.8 Proceedings Paper

REPLAY AND SYNTHETIC SPEECH DETECTION WITH RES2NET ARCHITECTURE

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) (2021)

Journal

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)

Volume -, Issue -, Pages 6354-6358

Publisher

IEEE

DOI: 10.1109/ICASSP39728.2021.9413828

Keywords

ASV anti-spoofing; replay detection; synthetic speech detection; Res2Net; multi-scale feature

Funding

HKSAR Government's Research Grants Council General Research Fund [14208718]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

By introducing the Res2Net model structure and multi-scale mechanism, the generalizability of the anti-spoofing countermeasure has been improved, and the model size has been reduced. Experimental results show that the performance of Res2Net in the ASVspoof 2019 corpus is significantly better than other models, especially excelling in physical access and logical access scenarios.

Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.

REPLAY AND SYNTHETIC SPEECH DETECTION WITH RES2NET ARCHITECTURE

Journal

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)

Publisher

IEEE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

REPLAY AND SYNTHETIC SPEECH DETECTION WITH RES2NET ARCHITECTURE

Journal

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)

Publisher

IEEE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper