4.7 Article

An Improved Inter-Intra Contrastive Learning Framework on Self-Supervised Video Representation

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2022.3141051

关键词

Task analysis; Learning systems; Data models; Optical imaging; Feature extraction; Representation learning; Optical sensors; Self-supervised learning; video representation; video recognition; video retrieval; spatio-temporal convolution

资金

  1. Japan Society for the Promotion of Science (JSPS) [JP19K20289, JP18H03339]

向作者/读者索取更多资源

This paper proposes a self-supervised contrastive learning method for learning video feature representations. By introducing intra-negative samples and utilizing strong data augmentations, the proposed method achieves significant improvements in video retrieval and video recognition tasks.
In this paper, we propose a self-supervised contrastive learning method to learn video feature representations. In traditional self-supervised contrastive learning methods, constraints from anchor, positive, and negative data pairs are used to train the model. In such a case, different samplings of the same video are treated as positives, and video clips from different videos are treated as negatives. Because the spatio-temporal information is important for video representation, we set the temporal constraints more strictly by introducing intra-negative samples. In addition to samples from different videos, negative samples are extended by breaking temporal relations in video clips from the same anchor video. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn feature representations from videos. Strong data augmentations, residual clips, as well as head projector are utilized to construct an improved version. Three kinds of intra-negative generation functions are proposed and extensive experiments using different network backbones are conducted on benchmark datasets. Without using pre-computed optical flow data, our improved version can outperform previous IIC by a large margin, such as 19.4% (from 36.8% to 56.2%) and 5.2% (from 15.5% to 20.7%) points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, over 3% points improvements can also be obtained on these two benchmark datasets. Discussions and visualizations validate that our IICv2 can capture better temporal clues and indicate the potential mechanism.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据