☆ 4.5 Article

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

CMC-COMPUTERS MATERIALS & CONTINUA (2021)

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

卷 67, 期 3, 页码 4039-4059

出版社

TECH SCIENCE PRESS

DOI: 10.32604/cmc.2021.015070

关键词

Affective computing; one-dimensional dilated convolutional neural network; emotion recognition; gated recurrent unit; raw audio clips

类别

Computer Science, Information Systems Materials Science, Multidisciplinary

资金

National Research Foundation of Korea - Korean Government through the Ministry of Science and ICT [NRF-2020R1F1A1060659]
2020 Faculty Research Fund of Sejong University

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Researchers have developed new techniques in emotion recognition field, utilizing one-dimensional convolutional neural network and bidirectional gated recurrent unit to improve emotion recognition accuracy. Their model achieved different recognition rates on three benchmark datasets.

Emotion recognition from speech data is an active and emerging area of research that plays an important role in numerous applications, such as robotics, virtual reality, behavior assessments, and emergency call centers. Recently, researchers have developed many techniques in this field in order to ensure an improvement in the accuracy by utilizing several deep learning approaches, but the recognition rate is still not convincing. Our main aim is to develop a new technique that increases the recognition rate with reasonable cost computations. In this paper, we suggested a new technique, which is a one-dimensional dilated convolutional neural network (1D-DCNN) for speech emotion recognition (SER) that utilizes the hierarchical features learning blocks (HFLBs) with a bi-directional gated recurrent unit (BiGRU). We designed a one-dimensional CNN network to enhance the speech signals, which uses a spectral analysis, and to extract the hidden patterns from the speech signals that are fed into a stacked one-dimensional dilated network that are called HFLBs. Each HFLB contains one dilated convolution layer (DCL), one batch normalization (BN), and one leaky_relu (Relu) layer in order to extract the emotional features using a hieratical correlation strategy. Furthermore, the learned emotional features are feed into a BiGRU in order to adjust the global weights and to recognize the temporal cues. The final state of the deep BiGRU is passed from a softmax classifier in order to produce the probabilities of the emotions. The proposed model was evaluated over three bench marked datasets that included the IEMOCAP, EMO-DB, and RAVDESS, which achieved 72.75%, 91.14%, and 78.01% accuracy, respectively.

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

出版社

TECH SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

出版社

TECH SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文