4.7 Article

MES-P: An Emotional Tonal Speech Dataset in Mandarin with Distal and Proximal Labels

Journal

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING
Volume 13, Issue 1, Pages 408-425

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TAFFC.2019.2945322

Keywords

Emotional speech; Mandarin; dataset; distal labels; proximal labels; tonal speech; emotion intensities

Funding

  1. National Natural Science Foundation of China [61906128]
  2. National Natural Science Foundation of Jiangsu Province [BK20180834]
  3. French Research Agency, l'Agence Nationale de Recherche (ANR) [ANR-13-CORD-0004-02]

Ask authors/readers for more resources

In this paper, a Mandarin Chinese emotional tonal speech dataset MES-P is proposed, which includes both distal and proximal labels. The dataset allows studying human emotional intelligence and emotional misunderstandings in real life. It captures the features of tonal languages and provides emotional speech samples matching the tonal distribution in Mandarin. The dataset also features emotion intensity variations and shows high consistency between emotional intentions and perceptions.
Emotion shapes all aspects of our interpersonal and intellectual experiences. Its automatic analysis has therefore many applications. In this paper, we propose an emotional tonal speech dataset, Mandarin Chinese Emotional Speech Dataset-Portrayed (MES-P), with both distal and proximal labels. In contrast with state of the art datasets which only focused on perceived emotions, MES-P includes not only perceived emotions (proximal labels) but also intended emotions (distal labels), to make it possible to study human emotional intelligence, i.e., emotion expression/understanding ability, and emotional misunderstandings in real life. Furthermore, MES-P also captures a main feature of tonal languages, and provides emotional speech samples matching the tonal distribution in real life Mandarin. MES-P dataset also features emotion intensity variations, by introducing both moderate and intense versions for joy, anger, and sadness, in addition to neutral. Ratings of the collected speech samples are made in valence-arousal space through continuous coordinate locations, resulting in an emotional distribution pattern in 2D VA space. High consistency between the speakers emotional intentions and the listeners perceptions is also proved by Cohens Kappa coefficients. Finally, extensive experiments are carried out as a baseline on MES-P for automatic emotion recognition and with comparison to human emotion intelligence.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available