4.6 Article

MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding

期刊

NEUROCOMPUTING
卷 554, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.neucom.2023.126625

关键词

Multimodal learning; Weakly-supervised learning; Deep neural networks; Video temporal grounding

向作者/读者索取更多资源

In this paper, we propose a novel weakly-supervised model called Multi-level Attentional Reconstruction Networks (MARN) for video temporal grounding. MARN leverages attentional reconstruction to train an attention map that can reconstruct the given query, and ranks proposals based on attention scores to localize the most suitable segment. It effectively aligns video-level supervision and proposal scoring, and incorporates a multi-level framework to generate and score variable-length and fix-length time sequences.
Video temporal grounding is a challenging task in computer vision that involves localizing a video segment semantically related to a given query from a set of videos and queries. In this paper, we propose a novel weakly-supervised model called the Multi-level Attentional Reconstruction Networks (MARN), which is trained on video-sentence pairs. During the training phase, we leverage the idea of attentional reconstruction to train an attention map that can reconstruct the given query. At inference time, proposals are ranked based on attention scores to localize the most suitable segment. In contrast to previous methods, MARN effectively aligns video-level supervision and proposal scoring, thereby reducing the training-inference discrepancy. In addition, we incorporate a multi-level framework that encompasses both proposal-level and clip-level processes. The proposal-level process generates and scores variable-length time sequences, while the clip-level process generates and scores fix-length time sequences to refine the predicted scores of the proposal in both training and testing. To improve the feature representation of the video, we propose a novel representation mechanism that utilizes intra-proposal information and adopts 2D convolution to extract inter-proposal clues for learning reliable attention maps. By accurately representing these proposals, we can better align them with the textual modalities, and thus facilitate the learning of the model. Our proposed MARN is evaluated on two benchmark datasets, and extensive experiments demonstrate its superiority over existing methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据