期刊
出版社
IEEE
DOI: 10.1109/ICASSP43922.2022.9747631
关键词
Audio; multimodal; zero-shot; classification
资金
- BMBF [01IS19074, 01IS19075, 01IW20005]
- TU Kaiserslautern PhD program
The field of sound classification has benefited greatly from methods of other domains. The trend is now to combine domain-specific tasks and approaches, resulting in exceptional models. AudioCLIP is an extension of the CLIP model that can handle audio as well as text and images, while maintaining zero-shot capabilities. It achieves state-of-the-art results in the ESC task and sets new baselines for zero-shot ESC tasks.
The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP - an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audiomodel into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP's zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97:15% on ESC-50 and 90:07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69:40% and 68:78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据