4.6 Article

Voice Activity Detection Optimized by Adaptive Attention Span Transformer

Journal

IEEE ACCESS
Volume 11, Issue -, Pages 31238-31243

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2023.3262518

Keywords

Transformers; Feature extraction; Mel frequency cepstral coefficient; Filter banks; Voice activity detection; Speech recognition; Spectrogram; adaptive attention span transformer; voice biometrics; voice command recognition

Ask authors/readers for more resources

Voice Activity Detection (VAD) is a technique widely used for separating vocal regions from audio signals. A new method called AAT-VAD is proposed, which integrates an adaptive width attention learning mechanism into the classic transformer framework. Experimental results show that AAT-VAD outperforms DCU-10 and Tr-VAD in terms of F1-score and detection cost function (DCF), with significantly reduced test time.
Voice Activity Detection (VAD) is a widely used technique for separating vocal regions from audio signals, with applications in voice language coding, noise reduction, and other domains. While various strategies have been proposed to improve VAD performance, such as ACAM, DCU-10, and Tr-VAD, these approaches often suffer from common limitations, including being unsuitable for long audio and being time-consuming. To address these issues, a new method called AAT-VAD is proposed, which integrates an adaptive width attention learning mechanism into the classic transformer framework. The approach involves extracting Mel-scale Frequency Cepstral Coefficients (MFCC) from the Mel scale frequency domain, adding a masking function to each transformer attention head, and inputting the features processed by the transformer encoder layer into the classifier. Experimental results indicate that a 12.8% higher F1-score is achieved by the method compared to DCU-10, and a 0.6% higher F1-score is achieved compared to Tr-VAD under different noise interferences. Furthermore, the average detection cost function (DCF) value of the method is only 14.3% of DCU-10 and 92.4% of Tr-VAD, and the test time of AAT-VAD is only 37.4% of that of Tr-VAD for the same noisy vocal mixed audio.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available