期刊
NEUROCOMPUTING
卷 482, 期 -, 页码 154-162出版社
ELSEVIER
DOI: 10.1016/j.neucom.2021.11.031
关键词
Contrastive Learning; Self-Attention; Video Representation
资金
- Education Department of Jiangxi Province of China [GJJ204912]
- Science and Technology Bureau of Ganzhou City of China [[2020]60]
This paper presents a novel framework of self-supervised learning for video representation. The framework combines contrastive predictive coding and self-attention, and introduces the Transformer architecture to capture long-range spatio-temporal dependencies. The model achieves state-of-the-art self-supervised performance on UCF101 and HMDB51 datasets.
This paper presents a novel framework of self-supervised learning for video representation. Inspired by Contrastive Predictive Coding and Self-attention, we make the following contributions: First, we propose the Contrastive Predictive Coding with Transformer (CPCTR) framework for video representation learning in a self-supervised fashion. Second, we introduce the Transformer architecture to CPCTR to capture long-range spatio-temporal dependencies in order to facilitate the learning of slow features in video, and we conduct analysis of Transformer in our model to show its effectiveness. Finally, we evaluate our model by first training on the UCF101 dataset with self-supervised learning, and then fine-tuning on downstream video classification tasks. Using RGB only video data, we achieve state-of-the-art self-supervised performance on both UCF101 (Top1 accuracy of 99.3%) and HMDB51 (Top1 accuracy of 82.4%), we show that CPCTR even outperforms fully supervised methods on the two datasets. The code is available at https://github.com/yliu1229/CPCTR. (C) 2021 Elsevier B.V. All rights reserved.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据