3.8 Proceedings Paper

AST: Audio Spectrogram Transformer

期刊

INTERSPEECH 2021
卷 -, 期 -, 页码 571-575

出版社

ISCA-INT SPEECH COMMUNICATION ASSOC
DOI: 10.21437/Interspeech.2021-698

关键词

audio classification; self-attention; Transformer

资金

  1. Signify

向作者/读者索取更多资源

This paper introduces AST, a convolution-free, purely attention-based model for audio classification, which achieves impressive performance on various audio classification benchmarks.
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据