4.6 Article

Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASLP.2022.3178232

关键词

Training; Representation learning; Emotion recognition; Databases; Training data; Speech recognition; Feature extraction; Speech emotion recognition; speaker independent; adversarial learning; unsupervised domain adaptation; multi-source domain adaptation

资金

  1. National Natural Science Foundation of China (NSFC) [U2003207, 61921004, 61902064, 62076195]
  2. Jiangsu Frontier Technology Basic Research Project [BK20192004]
  3. Zhishan Young Scholarship of Southeast University
  4. Scientific Research Foundation of Graduate School of Southeast University [YBPY1955]
  5. German Research Foundation (DFG) [442218748]

向作者/读者索取更多资源

In this paper, a novel domain invariant feature learning method is proposed for speaker-independent speech emotion recognition. The proposed method eliminates domain shifts caused by different speakers and learns speaker-invariant emotion features. Experimental results demonstrate the superiority of the proposed method in SER performance.
In this paper, we propose a novel domain invariant feature learning (DIFL) method to deal with speaker-independent speech emotion recognition (SER). The basic idea of DIFL is to learn the speaker-invariant emotion feature by eliminating domain shifts between the training and testing data caused by different speakers from the perspective of multi-source unsupervised domain adaptation (UDA). Specifically, we embed a hierarchical alignment layer with the strong-weak distribution alignment strategy into the feature extraction block to firstly reduce the discrepancy in feature distributions of speech samples across different speakers as much as possible. Furthermore, multiple discriminators in the discriminator block are utilized to confuse the speaker information of emotion features both inside the training data and between the training and testing data. Through them, a multi-domain invariant representation of emotional speech can be gradually and adaptively achieved by updating network parameters. We conduct extensive experiments on three public datasets, i. e., Emo-DB, eNTERFACE, and CASIA, to evaluate the SER performance of the proposed method, respectively. The experimental results show that the proposed method is superior to the state-of-the-art methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据