4.6 Article

CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASLP.2023.3319276

Keywords

Adaptation; automatic speech recognition; dysarthria

Ask authors/readers for more resources

This article proposes a cognition-inspired feature decomposition and recombination network (CFDRN) for dysarthric ASR. CFDRN uses slow- and rapid-varying temporal processors to decompose features into stable and changeable features, and a gated fusion module for selective recombination. The study also utilizes an adaptation approach based on unsupervised pre-training techniques. Experimental results show significant word error rate reductions compared to baseline methods on the TORGO and UASpeech datasets.
As an essential technology in human-computer interactions, automatic speech recognition (ASR) ensures a convenient life for healthy people; however, people with speech disorders, who truly need support from such a technology, have experienced difficulties in the use of ASR. Disordered ASR is challenging because of the large variabilities in disordered speech. Humans tend to separately process different spectro-temporal features of speech in the left and right hemispheres of their brain, showing significantly better ability in speech perception than machines, especially in disordered speech perception. Inspired by human speech processing, this article proposes a cognition-inspired feature decomposition and recombination network (CFDRN) for dysarthric ASR. In the CFDRN, slow- and rapid-varying temporal processors are designed to decompose features into stable and changeable features, respectively. A gated fusion module was developed to selectively recombine the decomposed features. Moreover, this study utilised an adaptation approach based on unsupervised pre-training techniques to alleviate data scarcity issues in dysarthric ASR. The CFDRNs were added to the layers of the pre-trained model, and the entire model is adapted from normal speech to disordered speech. The effectiveness of the proposed method was validated on the widely used TORGO and UASpeech dysarthria datasets under three popular unsupervised pre-training techniques, wav2vec 2.0, HuBERT, and data2vec. When compared to the baseline methods, the proposed CFDRN with the three pre-training techniques achieved 13.73%similar to 16.23% and 4.50%similar to 13.20% word error rate reductions on the TORGO and UASpeech datasets, respectively. Furthermore, this study clarified several major factors affecting dysarthric ASR performance.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available