4.7 Article

Synergic learning for noise-insensitive webly-supervised temporal action localization

Journal

IMAGE AND VISION COMPUTING
Volume 113, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.imavis.2021.104247

Keywords

Temporal action localization; Web supervision; Spatio-temporal representation

Funding

  1. IER foundation [HT-JD-CXY-201904]
  2. Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing)

Ask authors/readers for more resources

Webly-supervised temporal action localization leverages web videos to train models without manual temporal annotations. The proposed framework and synergic learning paradigm effectively mitigate noise interference caused by web video labels, outperforming existing WebTAL methods on public benchmarks. Introducing tasks like Spatio-Temporal Order Prediction and Warm-up Synergic Training improves spatio-temporal representation learning and action localization results.
Webly-supervised temporal action localization (WebTAL) leverages web videos to train localization models without requiring manual temporal annotations. WebTAL is extremely challenging since video-level labels on the web are always noisy, seriously damaging the overall performance. Most state-of-the-art methods filter out noise before training, which will inevitably reduce the training samples. In contrast, we propose a preprocessing-free WebTAL framework along with a new synergic learning paradigm to alleviate the noise interference. Specifically, we introduce a synergic task called Spatio-Temporal Order Prediction (STOP) for spatiotemporal representation learning. This task requires a network to arrange permuted spatial crops and temporal clips, thereby learning the inherent spatial semantics and temporal interactions in videos. Instead of pre extracting features with the well-trained STOP, we design a novel synergic learning paradigm called Warm-up Synergic Training (WST) to iteratively generate better spatio-temporal representations and improve action localization results. In this synergic fashion, experimental results show that the interference caused by label noise will be largely mitigated. We demonstrate that our method outperforms all other WebTAL methods on two public benchmarks, THUMOS'14 and ActivityNet v1.2. (c) 2021 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available