4.6 Article

Using Hybrid HMM/DNN Embedding Extractor Models in Computational Paralinguistic Tasks

期刊

SENSORS
卷 23, 期 11, 页码 -

出版社

MDPI
DOI: 10.3390/s23115208

关键词

hidden Markov model; deep neural network; embedding; hybrid acoustic model; computational paralinguistics

向作者/读者索取更多资源

The field of computational paralinguistics deals with non-verbal content in human speech and its applications, such as emotion recognition and sleepiness detection. Two main technical challenges are handling varying-length utterances and training models on small corpora. This study presents a method that combines automatic speech recognition and paralinguistic approaches to address these challenges. Experimental results show that the proposed method outperforms the baseline approach and is competitive and resource-efficient for various paralinguistic tasks, depending on the aggregation techniques and neural network layers used.
The field of computational paralinguistics emerged from automatic speech processing, and it covers a wide range of tasks involving different phenomena present in human speech. It focuses on the non-verbal content of human speech, including tasks such as spoken emotion recognition, conflict intensity estimation and sleepiness detection from speech, showing straightforward application possibilities for remote monitoring with acoustic sensors. The two main technical issues present in computational paralinguistics are (1) handling varying-length utterances with traditional classifiers and (2) training models on relatively small corpora. In this study, we present a method that combines automatic speech recognition and paralinguistic approaches, which is able to handle both of these technical issues. That is, we trained a HMM/DNN hybrid acoustic model on a general ASR corpus, which was then used as a source of embeddings employed as features for several paralinguistic tasks. To convert the local embeddings into utterance-level features, we experimented with five different aggregation methods, namely mean, standard deviation, skewness, kurtosis and the ratio of non-zero activations. Our results show that the proposed feature extraction technique consistently outperforms the widely used x-vector method used as the baseline, independently of the actual paralinguistic task investigated. Furthermore, the aggregation techniques could be combined effectively as well, leading to further improvements depending on the task and the layer of the neural network serving as the source of the local embeddings. Overall, based on our experimental results, the proposed method can be considered as a competitive and resource-efficient approach for a wide range of computational paralinguistic tasks.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据