☆ 4.7 Article

MSQAT: A multi-dimension non-intrusive speech quality assessment transformer utilizing self-supervised representations

APPLIED ACOUSTICS (2023)

Journal

APPLIED ACOUSTICS

Volume 212, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.apacoust.2023.109584

Keywords

Speech quality assessment; Non-intrusive; Self-supervised learning; Transformer

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This research proposes a framework called MSQAT, which includes three modules - ASTB, TAB, and RSTB, to enhance the interactions between local and global speech regions. Additionally, a two-branch structure is designed for better speech quality evaluation. Experimental results demonstrate that MSQAT achieves state-of-the-art performance on three standard datasets, and the pure attention model can achieve or surpass the performance of other CNN-attention hybrid models.

Convolutional neural networks (CNNs) have been widely utilized as the main building block for many non-intrusive speech quality assessment (NISQA) methods. A new trend is to add a self-attention mechanism based on CNN to better capture long-term global content. However, it is not clear whether the pure attention-based network is sufficient to obtain good performance in NISQA. To this end, a framework named Multi-dimension non-intrusive Speech Quality Assessment Transformer (MSQAT) is proposed. To strengthen the interactions of various speech regions between local and global, we proposed the Audio Spectrogram Transformer Block (ASTB), Transposed Attention Block (TAB) and the Residual Swin Transformer Block (RSTB). These three modules employ attention mechanisms across spatial and channel dimensions, respectively. Additionally, speech quality varies not only in different frames, but also in different frequencies. Thus, a two-branch structure is designed to better evaluate the quality of speech by considering the weighting of each patch's score. Experimental results demonstrate that the proposed MSQAT has state-of-the-art performance on three standard datasets (NISQA Corpus, Tencent Corpus, and PSTN Corpus) and indicate that the pure attention model can achieve or surpass the performance of other CNN-attention hybrid models.

MSQAT: A multi-dimension non-intrusive speech quality assessment transformer utilizing self-supervised representations

Journal

APPLIED ACOUSTICS

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MSQAT: A multi-dimension non-intrusive speech quality assessment transformer utilizing self-supervised representations

Journal

APPLIED ACOUSTICS

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper