4.7 Article

Deep CNN with late fusion for real time multimodal emotion recognition

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 240, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.122579

关键词

CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking

向作者/读者索取更多资源

This project aims to develop an efficient real-time multimodal emotion recognition model for analyzing emotion expression in human oration videos. Different models were trained for text, audio, images, and multimodal analysis using separate datasets. The models were tested and combined on the CMU-MOSEI dataset to find the most effective architecture. The proposed architecture achieved high accuracy and F1-score on the CMU-MOSEI dataset.
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据