☆ 4.6 Article

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2020)

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

卷 28, 期 -, 页码 682-695

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2020.2964953

关键词

Convolutional recurrent neural network; speech articulatory attributes; maximal figure-of-merit; deep bottleneck features; spoken language recognition

类别

Acoustics Engineering, Electrical & Electronic

资金

Academy of Finland [313970]
Finnish Scientific Advisory Board for Defence (MATINE) Project [2500M-0106]
ARAP grant from the Institute for Infocomm Research, A*STAR
Academy of Finland (AKA) [313970, 313970] Funding Source: Academy of Finland (AKA)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this article, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional universal phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文