3.8 Proceedings Paper

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

出版社

IEEE
DOI: 10.1109/ICASSP43922.2022.9747631

关键词

Audio; multimodal; zero-shot; classification

资金

  1. BMBF [01IS19074, 01IS19075, 01IW20005]
  2. TU Kaiserslautern PhD program

向作者/读者索取更多资源

The field of sound classification has benefited greatly from methods of other domains. The trend is now to combine domain-specific tasks and approaches, resulting in exceptional models. AudioCLIP is an extension of the CLIP model that can handle audio as well as text and images, while maintaining zero-shot capabilities. It achieves state-of-the-art results in the ESC task and sets new baselines for zero-shot ESC tasks.
The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP - an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audiomodel into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP's zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97:15% on ESC-50 and 90:07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69:40% and 68:78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据