☆ 4.7 Article

Comparison and Analysis of SampleCNN Architectures for Audio Classification

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (2019)

期刊

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING

卷 13, 期 2, 页码 285-297

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/JSTSP.2019.2909479

关键词

Audio classification; end-to-end learning; convolutional neural networks; residual networks; squeeze-and-excitation networks; interpretability

类别

Engineering, Electrical & Electronic

资金

National Research Foundation of Korea [2015R1C1A1A02036962]
National Research Foundation of Korea [31Z20130012985] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

End-to-end learning with convolutional neural networks (CNNs) has become a standard approach in image classification. However, in audio classification, CNN-based models that use time-frequency representations as input are still popular. A recently proposed CNN architecture called SampleCNN takes raw waveforms directly and has very small sizes of filters. The architecture has proven to be effective in music classification tasks. In this paper, we scrutinize SampleCNN further by comparing it with spectrogram-based CNN and changing the suhsampling operation in three different audio domains: music, speech, and acoustic scene sound. Also, we extend SampleCNN to more advanced versions using components from residual networks and squeezeand-excitation networks. The results show that the squeeze-andexcitation block is particularly effective among them. Furthermore, we analyze the trained models to provide better understanding of the architectures. First, we visualize hierarchically learned features to see how the filters with small granularity adapt to audio signals from different domains. Second, we observe the squeeze-and-excitation block by plotting the distribution of excitation in several different ways. This analysis shows that the excitation tends to be increasingly class specific with increasing depth but the first layer that takes raw waveforms directly can be highly class specific, particularly in music data. We examine this further and show that the excitation in the first layer is sensitive to the loudness, which is an acoustic characteristic that distinguishes different genres of music.

Comparison and Analysis of SampleCNN Architectures for Audio Classification

期刊

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Comparison and Analysis of SampleCNN Architectures for Audio Classification

期刊

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文