☆ 4.5 Article

DeepCNN: Spectro-temporal feature representation for speech emotion recognition

CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY (2023)

Journal

CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY

Volume 8, Issue 2, Pages 401-417

Publisher

WILEY

DOI: 10.1049/cit2.12233

Keywords

decision making; deep learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This article proposes a deep learning-based approach called DeepCNN for speech emotion recognition. By parallelising convolutional neural networks and a convolution layer-based transformer, this method allows for effective feature representation with lower computational cost. Experimental results on the EMO-BD and IEMOCAP datasets demonstrate the superior performance of DeepCNN in emotion recognition accuracy.

Speech emotion recognition (SER) is an important research problem in human-computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer-based transformer. Two parallel CNNs are applied to extract the spectral features (2D-CNN) and temporal features (1D-CNN) representations. A 2D-convolution layer-based transformer module extracts spectro-temporal features and concatenates them with features from parallel CNNs. The learnt low-level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high-level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO-BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO-BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO-BD and IEMOCAP datasets.

DeepCNN: Spectro-temporal feature representation for speech emotion recognition

Journal

CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY

Publisher

WILEY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

DeepCNN: Spectro-temporal feature representation for speech emotion recognition

Journal

CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY

Publisher

WILEY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper