4.7 Article

Order-Constrained Representation Learning for Instructional Video Prediction

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2022.3149329

关键词

Task analysis; Semantics; Representation learning; Visualization; Market research; Feature extraction; Automobiles; Instructional video; video prediction; weakly-supervised learning; representation learning

资金

  1. National Natural Science Foundation of China [62125603, U1813218]
  2. Beijing Academy of Artificial Intelligence(BAAI)

向作者/读者索取更多资源

In this paper, a weakly-supervised approach called Order-Constrained Representation Learning (OCRL) is proposed to predict future actions from instructional videos by observing incomplete steps of actions. The approach learns video representations from step order-rearranged trimmed video clips and integrates shared semantic information between step order and task semantics. The results show improvements in comparison to conventional prediction methods.
In this paper, we propose a weakly-supervised approach called Order-Constrained Representation Learning (OCRL) to predict future actions from instructional videos by observing incomplete steps of actions. Most conventional methods focus on predicting actions based on partially observed video frames, which mainly study low-level semantics such as motion consistency. Unlike performing a single action, completing a task in an instructional video usually requires several steps of action and longer periods. Motivated by the fact that the order of action steps is key to learning task semantics, we develop a new frame of contrastive loss, called StepNCE, to integrate the shared semantic information between step order and task semantics under the framework of the memory bank-based momentum-updating algorithm. Specifically, we learn the video representations from step order-rearranged trimmed video clips based on the proposed task-consistency rule and order-consistency rule. Our StepNCE loss can be used to pre-train a video feature encoder, which is then fine-tuned to carry out the instructional video prediction task. Our approach digs deeper into the sequential logic between different action steps with respect to a certain task, which is able to promote the video understanding methods to a new semantic level. We evaluate our method on five popular instructional video and action prediction datasets: COIN, CrossTask, UT-Interaction, BIT-Interaction, and ActivityNet v1.2, and the results show that our approach gains improvements from conventional prediction methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据