4.6 Article

Emotion classification from speech signal based on empirical mode decomposition and non-linear features Speech emotion recognition

Journal

COMPLEX & INTELLIGENT SYSTEMS
Volume 7, Issue 4, Pages 1919-1934

Publisher

SPRINGER HEIDELBERG
DOI: 10.1007/s40747-021-00295-z

Keywords

Speech signal; Emotion perception; Entropy measures; Linear discriminant analysis; Empirical mode decomposition

Funding

  1. Scientific Research Grant of Shantou University, China [NTF17016]

Ask authors/readers for more resources

This paper investigates the recognition of seven emotional states from speech signals using entropy-based feature extraction and various classifiers, achieving promising results on the Toronto Emotional Speech dataset.
Emotion recognition system from speech signal is a widely researched topic in the design of the Human-Computer Interface (HCI) models, since it provides insights into the mental states of human beings. Often, it is required to identify the emotional condition of the humans as cognitive feedback in the HCI. In this paper, an attempt to recognize seven emotional states from speech signals, known as sad, angry, disgust, happy, surprise, pleasant, and neutral sentiment, is investigated. The proposed method employs a non-linear signal quantifying method based on randomness measure, known as the entropy feature, for the detection of emotions. Initially, the speech signals are decomposed into Intrinsic Mode Function (IMF), where the IMF signals are divided into dominant frequency bands such as the high frequency, mid-frequency , and base frequency. The entropy measures are computed directly from the high-frequency band in the IMF domain. However, for the mid- and base-band frequencies, the IMFs are averaged and their entropy measures are computed. A feature vector is formed from the computed entropy measures incorporating the randomness feature for all the emotional signals. Then, the feature vector is used to train a few state-of-the-art classifiers, such as Linear Discriminant Analysis (LDA), Naive Bayes, K-Nearest Neighbor, Support Vector Machine, Random Forest, and Gradient Boosting Machine. A tenfold cross-validation, performed on a publicly available Toronto Emotional Speech dataset, illustrates that the LDA classifier presents a peak balanced accuracy of 93.3%, F1 score of 87.9%, and an area under the curve value of 0.995 in the recognition of emotions from speech signals of native English speakers.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available