☆ 4.6 Article

Zero-Shot Audio Classification Via Semantic Embeddings

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (2021)

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

卷 29, 期 -, 页码 1233-1242

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TASLP.2021.3065234

关键词

Semantics; Acoustics; Training; Bit error rate; Testing; Supervised learning; Speech processing; Audio classification; semantic embedding; zero-shot learning

类别

Acoustics Engineering, Electrical & Electronic

资金

European Research Council under the European Union [637422 EVERYSOUND]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper investigates zero-shot learning in audio classification using semantic embeddings extracted from textual labels and sentence descriptions, demonstrating the effectiveness of a bilinear compatibility framework and deep acoustic embeddings in improving classification performance. By involving semantically close sound classes in training and concatenating label/sentence embeddings from different language models, the results are further enhanced.

In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.

Zero-Shot Audio Classification Via Semantic Embeddings

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Zero-Shot Audio Classification Via Semantic Embeddings

期刊

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文