☆ 4.7 Article

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

NEURAL NETWORKS (2021)

期刊

NEURAL NETWORKS

卷 141, 期 -, 页码 52-60

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.neunet.2021.03.013

关键词

Speech emotion recognition; Parallel 2D CNN; Connectionist temporal classification; Residual dilated network; Self-attention

类别

Computer Science, Artificial Intelligence Neurosciences

资金

National Natural Science Foundation of China [62071330, 61702370]
National Science Fund for Distinguished Young Scholars, China [61425017]
Key Program of the National Natural Science Foundation of China [61831022]
Natural Science Foundation of Tianjin, China [18JCZDJC36300]
technology plan of Tianjin, China [18ZXRHSY00100]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes an efficient deep neural network architecture for speech emotion recognition, integrating PCN, SENet, and CTC to improve SER performance. Experimental results demonstrate the effectiveness of the proposed method on IEMOCAP.

A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER). Moreover, we also demonstrate the existence of further opportunities to improve SER performance by exploiting the properties of convolutional neural networks (CNNs) when modelling contextual information. Our proposed model uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet), a system herein denoted as PCNSE, to extract relationships from 3D spectrograms across timesteps and frequencies; here, we use the log-Mel spectrogram with deltas and delta-deltas as input. In addition, a self-attention Residual Dilated Network (SADRN) with CTC is employed as a classification block for SER. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for discrete SER. We further demonstrate the effectiveness of our proposed approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpus (FAU-AEC). Our experimental results reveal that the proposed method is well-suited to the task of discrete SER, achieving a weighted accuracy (WA) of 73.1% and an unweighted accuracy (UA) of 66.3% on IEMOCAP, as well as a UA of 41.1% on the FAU-AEC dataset. (C) 2021 Elsevier Ltd. All rights reserved.

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

期刊

NEURAL NETWORKS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

期刊

NEURAL NETWORKS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文