4.7 Article

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING
Volume 30, Issue -, Pages 3252-3262

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIP.2021.3058614

Keywords

Grounding; Annotations; Two dimensional displays; Training; Feature extraction; Computational modeling; Task analysis; Weakly supervised; temporal sentence grounding

Funding

  1. National Key Research and Development Program [2018YFB0804204]
  2. Strategic Priority Research Program of Chinese Academy of Sciences [XDC02050500]
  3. National Natural Science Foundation of China [62022078, 62021001]
  4. Youth Innovation Promotion Association CAS [2018166]
  5. Open Project Program of the National Laboratory of Pattern Recognition (NLPR) [202000019]

Ask authors/readers for more resources

LCNet utilizes hierarchical representation of video and text features and introduces a self-supervised cycle-consistent loss to effectively learn the matching relationships between video and text, achieving superior performance compared to existing weakly supervised methods.
Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available