4.7 Article

Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier

期刊

APPLIED ACOUSTICS
卷 166, 期 -, 页码 -

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.apacoust.2020.107360

关键词

Speech emotion recognition; Quantum-behaved particle swarm optimization; Gaussian elliptical basis function

向作者/读者索取更多资源

In this paper, a hybrid system consisting of three stages of feature extraction, dimensionality reduction, and feature classification is proposed for speech emotion recognition (SER). At feature extraction stage, an informationally-rich spectral-prosodic hybrid feature vector comprised of perceptual-spectral features; that is, mel-frequency cepstral coefficient (MFCC), perceptual linear prediction coefficient (PLPC), and perceptual minimum variance distortionless response (PMVDR) coefficient along with the prosodic feature of pitch (i.e. F0) are extracted for each frame. This feature vector is extracted from both speech signal and its glottal-waveform. The first and the second-order derivatives are then added to the above-mentioned vector to form a high-dimensional hybrid feature vector characterized by a large number of dimensions. At the next stage, i.e. dimensionality reduction, the dimensionality of this feature vector is reduced using a new proposed quantum-behaved particle swarm optimization (QPSO)-based approach. In this paper, a new QPSO algorithm (so-called, pQPSO) is presented that makes use of a truncated Laplace distribution (TLD) to generate new particles and thus to produce solutions (i.e. particles) that are all within a valid range of a problem (contrary to the standard QPSO). The contraction-expansion (CE) factor of the proposed pQPSO is also selected adaptively. Using the proposed QPSO algorithm, an optimal discriminative dimensionality reduction matrix (i.e. projection matrix) is estimated with emotion classification accuracy as a class-discriminative criterion. At the subsequent stage, vectors with reduced feature dimensionality are fed into a Gaussian elliptical basis function (GEBF)-type neural network classifier to detect their speech emotion. To accelerate the training phase of the GEBF classifier, a fast-scaled conjugate gradient (SCG) algorithm is correspondingly employed that does not need to adjust the learning rate. Finally, the proposed method is evaluated on three standard emotional speech databases of Berlin Database of Emotional Speech (EMODB), Surrey Audio-Visual Expressed Emotion (SAVEE), and Interactive Emotional Dyadic Motion Capture (IEMOCAP). The experimental results showed that the proposed method was more accurate than state-of-the-art ones in terms of detecting speech emotions. (C) 2020 Elsevier Ltd. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据