☆ 3.8 Proceedings Paper

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) (2022)

期刊

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)

卷 -, 期 -, 页码 976-980

出版社

IEEE

DOI: 10.1109/ICASSP43922.2022.9747631

关键词

Audio; multimodal; zero-shot; classification

类别

Acoustics Computer Science, Artificial Intelligence Engineering, Electrical & Electronic

资金

BMBF [01IS19074, 01IS19075, 01IW20005]
TU Kaiserslautern PhD program

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The field of sound classification has benefited greatly from methods of other domains. The trend is now to combine domain-specific tasks and approaches, resulting in exceptional models. AudioCLIP is an extension of the CLIP model that can handle audio as well as text and images, while maintaining zero-shot capabilities. It achieves state-of-the-art results in the ESC task and sets new baselines for zero-shot ESC tasks.

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP - an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audiomodel into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP's zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97:15% on ESC-50 and 90:07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69:40% and 68:78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

期刊

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

期刊

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文