☆ 4.7 Article

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

INFORMATION FUSION (2019)

期刊

INFORMATION FUSION

卷 46, 期 -, 页码 184-192

出版社

ELSEVIER

DOI: 10.1016/j.inffus.2018.06.003

关键词

Multi-modality emotion recognition; Deep learning; Transfer learning

类别

Computer Science, Artificial Intelligence Computer Science, Theory & Methods

资金

National Natural Science Foundation of China [61672246, 61272068, 61572220]
Fundamental Research Funds for the Central Universities [HUST:2016YXMS018]
Slovenian Research Agency [P2-0246]
Hubei Provincial Key Project [2017CFA051]
Applied Basic Research Program through Wuhan Science and Technology Bureau [2017010201010118]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network (DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set.

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文