4.3 Article

A New Amharic Speech Emotion Dataset and Classification Benchmark

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3529759

Keywords

Speech emotion recognition; Amharic dataset; classifiers; feature extraction

Ask authors/readers for more resources

This article introduces the Amharic Speech Emotion Dataset (ASED) which consists of four dialects and five emotions. It is the first dataset for Speech Emotion Recognition (SER) in Amharic. The dataset was created by 65 native Amharic speakers who recorded 2,474 sound samples. The resulting dataset is freely available for download.
In this article we present the Amharic Speech Emotion Dataset (ASED), which covers four dialects (Gojjam, Wollo, Shewa, and Gonder) and five different emotions (neutral, fearful, happy, sad, and angry). We believe it is the first Speech Emotion Recognition (SER) dataset for the Amharic language. Sixty-five volunteer participants, all native speakers ofAmharic, recorded 2,474 sound samples, 2 to 4 seconds in length. Eight judges (two for each dialect) assigned emotions to the samples with high agreement level (Fleiss kappa = 0.8). The resulting dataset is freely available for download. Next, we developed a four-layer variant of the well-known VGG model, which we call VGGb. Three experiments were then carried out using VGGb for SER, using ASED. First, we investigated which features work best for Amharic, FilterBank, Mel Spectrogram, or Mel-frequency Cepstral Coefficient (MFCC). This was done by training three VGGb SER models on ASED, using FilterBank, Mel Spectrogram, and MFCC features, respectively. Four forms of training were tried, standard cross-validation and three variants based on sentences, dialects, and speaker groups. Thus, a sentence used for training would not be used for testing, and the same for a dialect and speaker group. MFCC features were superior under all four training schemes. MFCC was therefore adopted for Experiment 2, where VGGb and three well-known existing models were compared on ASED: RESNet50, AlexNet, and LSTM. VGGb was found to have very good accuracy (90.73%) as well as the fastest training time. In Experiment 3, the performance of VGGb was compared when trained on two existing SER datasets-RAVDESS (English) and EMO-DB (German)-as well as on ASED (Amharic). Results are comparable across these languages, with ASED being the highest. This suggests that VGGb can be successfully applied to other languages. We hope that ASED will encourage researchers to explore the Amharic language and to experiment with other models for Amharic SER.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available