4.6 Article

Contrastive predictive coding with transformer for video representation learning

期刊

NEUROCOMPUTING
卷 482, 期 -, 页码 154-162

出版社

ELSEVIER
DOI: 10.1016/j.neucom.2021.11.031

关键词

Contrastive Learning; Self-Attention; Video Representation

资金

  1. Education Department of Jiangxi Province of China [GJJ204912]
  2. Science and Technology Bureau of Ganzhou City of China [[2020]60]

向作者/读者索取更多资源

This paper presents a novel framework of self-supervised learning for video representation. The framework combines contrastive predictive coding and self-attention, and introduces the Transformer architecture to capture long-range spatio-temporal dependencies. The model achieves state-of-the-art self-supervised performance on UCF101 and HMDB51 datasets.
This paper presents a novel framework of self-supervised learning for video representation. Inspired by Contrastive Predictive Coding and Self-attention, we make the following contributions: First, we propose the Contrastive Predictive Coding with Transformer (CPCTR) framework for video representation learning in a self-supervised fashion. Second, we introduce the Transformer architecture to CPCTR to capture long-range spatio-temporal dependencies in order to facilitate the learning of slow features in video, and we conduct analysis of Transformer in our model to show its effectiveness. Finally, we evaluate our model by first training on the UCF101 dataset with self-supervised learning, and then fine-tuning on downstream video classification tasks. Using RGB only video data, we achieve state-of-the-art self-supervised performance on both UCF101 (Top1 accuracy of 99.3%) and HMDB51 (Top1 accuracy of 82.4%), we show that CPCTR even outperforms fully supervised methods on the two datasets. The code is available at https://github.com/yliu1229/CPCTR. (C) 2021 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据