4.7 Article

Self Supervised Progressive Network for High Performance Video Object Segmentation

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNNLS.2022.3219936

Keywords

Task analysis; Customer relationship management; Semantics; Object segmentation; Collaboration; Visualization; Decoding; Cycle consistency; self-supervised; similarity learning; video object segmentation (VOS)

Funding

  1. Italy-China Collaboration Project TALENT [2018YFE0118400]
  2. National Natural Science Foundation of China [62272438, 61931008, 61772494, 61836002, 61976069]
  3. Fundamental Research Funds for Central Universities

Ask authors/readers for more resources

This paper proposes a self-supervised progressive network (SSPNet) for video object segmentation. SSPNet consists of a memory retrieval module (MRM) and a collaborative refinement module (CRM) to improve the performance. The MRM generates propagated coarse masks through self-supervised pixel-level and frame-level similarity learning, while the CRM refines the masks through cycle consistency region tracking. Novel mask-generation strategies are also designed to incorporate meaningful semantic information. Experimental results on multiple datasets demonstrate the superiority of SSPNet over state-of-the-art self-supervised methods, narrowing the gap with fully supervised methods.
Recently, self-supervised video object segmentation (VOS) has attracted much interest. However, most proxy tasks are proposed to train only a single backbone, which relies on a point-to-point correspondence strategy to propagate masks through a video sequence. Due to its simple pipeline, the performance of the single backbone paradigm is still unsatisfactory. Instead of following the previous literature, we propose our self-supervised progressive network (SSPNet) which consists of a memory retrieval module (MRM) and collaborative refinement module (CRM). The MRM can perform point-to-point correspondence and produce a propagated coarse mask for a query frame through self-supervised pixel-level and frame-level similarity learning. The CRM, which is trained via cycle consistency region tracking, aggregates the reference & query information and learns the collaborative relationship among them implicitly to refine the coarse mask. Furthermore, to learn semantic knowledge from unlabeled data, we also design two novel mask-generation strategies to provide the training data with meaningful semantic information for the CRM. Extensive experiments conducted on DAVIS-17, YouTube-VOS and SegTrack v2 demonstrate that our method surpasses the state-of-the-art self-supervised methods and narrows the gap with the fully supervised methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available