☆ 3.8 Proceedings Paper

AST: Audio Spectrogram Transformer

INTERSPEECH 2021 (2021)

期刊

INTERSPEECH 2021

卷 -, 期 -, 页码 571-575

出版社

ISCA-INT SPEECH COMMUNICATION ASSOC

DOI: 10.21437/Interspeech.2021-698

关键词

audio classification; self-attention; Transformer

类别

Audiology & Speech-Language Pathology Computer Science, Artificial Intelligence Computer Science, Software Engineering

资金

Signify

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces AST, a convolution-free, purely attention-based model for audio classification, which achieves impressive performance on various audio classification benchmarks.

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

AST: Audio Spectrogram Transformer

期刊

INTERSPEECH 2021

出版社

ISCA-INT SPEECH COMMUNICATION ASSOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

AST: Audio Spectrogram Transformer

期刊

INTERSPEECH 2021

出版社

ISCA-INT SPEECH COMMUNICATION ASSOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文