☆ 4.6 Article

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

BIOMEDICAL SIGNAL PROCESSING AND CONTROL (2019)

Journal

BIOMEDICAL SIGNAL PROCESSING AND CONTROL

Volume 47, Issue -, Pages 312-323

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.bspc.2018.08.035

Keywords

Speech emotion recognition; CNN LSTM network; Raw audio clips; Log-mel spectrograms

Funding

National Natural Science Foundation of China [61603013]
Fundamental Research Funds for the Central Universities [YWF-18-BJ-Y-181]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

We aimed at learning deep emotion features to recognize speech emotion. Two convolutional neural network and long short-term memory (CNN LSTM) networks, one 1D CNN LSTM network and one 2D CNN LSTM network, were constructed to learn local and global emotion-related features from speech and logmel spectrogram respectively. The two networks have the similar architecture, both consisting of four local feature learning blocks (LFLBs) and one long short-term memory (LSTM) layer. LFLB, which mainly contains one convolutional layer and one max-pooling layer, is built for learning local correlations along with extracting hierarchical correlations. LSTM layer is adopted to learn long-term dependencies from the learned local features. The designed networks, combinations of the convolutional neural network (CNN) and LSTM, can take advantage of the strengths of both networks and overcome the shortcomings of them, and are evaluated on two benchmark databases. The experimental results show that the designed networks achieve excellent performance on the task of recognizing speech emotion, especially the 2D CNN LSTM network outperforms the traditional approaches, Deep Belief Network (DBN) and CNN on the selected databases. The 2D CNN LSTM network achieves recognition accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments respectively, which compare favourably to the accuracy of 91.6% and 92.9% obtained by traditional approaches; and also yields recognition accuracies of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, which are much higher than the accuracy of 73.78% and 40.02% obtained by DBN and CNN. (C) 2018 Elsevier Ltd. All rights reserved.

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Journal

BIOMEDICAL SIGNAL PROCESSING AND CONTROL

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Journal

BIOMEDICAL SIGNAL PROCESSING AND CONTROL

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper