4.7 Article

Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 16, Issue 8, Pages 2203-2213

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2014.2360798

Keywords

Affective-salient discriminative feature analysis; convolutional neural networks; feature learning; speech emotion recognition

Funding

  1. National Nature Science Foundation of China [61272211, 61170126]
  2. Six Talent Peaks Foundation of Jiangsu Province [DZXX-027]

Ask authors/readers for more resources

As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect-related, discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e. g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available